Structured Outputs in LLMs: JSON Mode, Function Calling, and Schema Validation
Introduction
Language models are built to generate fluent text. They are excellent at writing, explaining, and conversing. But most production applications do not want a paragraph of prose, they want data they can parse, store in a database, or pass to another system.
Consider these real use cases: extracting a customer's name, email, and order number from a support email; parsing an invoice to return a list of line items with prices; classifying a support ticket into a predefined category; or generating an API call from a natural language command. All of these require the LLM to produce output in a specific, machine-readable format, typically JSON.
The problem is that LLMs are probabilistic text generators, not compilers. If you just ask "return the answer as JSON", the model might produce valid JSON most of the time, but in production, even a 2% failure rate means thousands of errors per day. This article explains why free-form output is unreliable and walks through four progressively more robust approaches to structured output.
Problem Statement
When you ask a language model to produce JSON, it generates text token by token. It does not have an internal JSON parser checking its work, it is predicting the next most probable token based on everything before it. This leads to a consistent set of failure modes that break downstream systems.
Syntax errors appear as missing commas, unmatched brackets, or trailing commas in positions where the JSON specification forbids them. Type mismatches occur when the model returns a string where your schema expects an integer. Required fields get silently omitted when the model decides they are not relevant to the input. Extra fields get added that your strict parser rejects. And the model frequently wraps the JSON in explanatory prose, "Here is the JSON you requested:", which breaks any parser that expects the response to start with a curly brace.
Each approach described below adds a stronger guarantee at the cost of some additional implementation effort.
Core Concepts and Terminology
| Term | Definition |
|---|---|
| Structured output | LLM-generated content formatted in a machine-readable schema such as JSON, XML, or a custom grammar. |
| JSON mode | An API feature that constrains the model's output to be syntactically valid JSON at every generation step. |
| Function calling / tool use | An API feature where you define a typed schema and the model fills it in, guaranteeing required fields and types. |
| Schema | A formal specification of the fields, types, and constraints a structured output must satisfy. |
| Logits | The raw probability scores the model assigns to each possible next token before generation. |
| Constrained decoding | A technique that filters or masks invalid token choices at each generation step to enforce a format constraint. |
| Grammar-constrained generation | Using a formal grammar (such as BNF) to define all valid output shapes, masking any token that would violate the grammar. |
| Pydantic | A Python library for data validation that checks whether a parsed structure matches a typed schema and raises detailed errors when it does not. |
| Retry loop with feedback | A pattern where a validation failure is fed back to the model as context so it can correct its output on the next attempt. |
| Hallucinated value | A value that is syntactically correct but semantically wrong, for example, a fabricated email address or invented product code. |
How It Works
Think of the problem like ordering from a restaurant. A zero-shot text prompt is like telling the waiter "bring me something healthy", you might get what you want, or you might not. JSON mode is like handing over a form with checkboxes, the waiter must fill in all boxes, but the choices within each box are still theirs. Function calling is like giving the waiter a typed order slip with mandatory fields, they cannot leave the table without filling in every required item. Grammar-constrained generation is like a vending machine, the interface only allows valid selections in the first place.
- Prompt engineering: You describe the desired JSON structure in plain language and show the model an example schema. The model tries to comply but has no mechanical enforcement. This works for simple cases and prototyping but fails unpredictably at scale.
- JSON mode: The API filters the model's token probability distribution at each step, zeroing out any token that would produce syntactically invalid JSON at the current position. The output is always parseable JSON. However, the model still decides what keys and values to include, it can omit required fields or use wrong types. JSON mode solves "will this parse?" but not "is this correct?" Note: OpenAI offers two related but distinct features here — JSON mode (guarantees syntactically valid JSON but not schema compliance) and Structured Outputs (released August 2024, guarantees full schema compliance via constrained decoding). Anthropic's API achieves structured output through tool use (function calling), which constrains the model to return a defined JSON schema; there is no separate JSON mode toggle as in OpenAI's API.
- Function calling: You define a function schema with typed parameters and required fields, analogous to a typed form the model must fill in. The API uses constrained decoding to ensure the output satisfies the schema, including type constraints and required fields. This is the gold standard for production reliability.
- Schema validation with Pydantic: Even after function calling, the model can produce values that are structurally valid but semantically wrong, a fabricated email, an invented ID. Pydantic validates the parsed output against business logic rules and raises detailed errors if any field fails. These errors can be fed back to the model for a corrective retry.
- Grammar-constrained generation: For self-hosted models, you define a formal grammar that specifies every valid token sequence for your output format. The inference engine masks any token that would leave a valid parse path, making syntax errors physically impossible. This is the most powerful approach but requires control over the inference engine.
Practical Example
A B2B SaaS company receives hundreds of support emails daily. They want to automatically extract the sender's name, email address, phone number, and a one-sentence description of their issue, then insert this into a CRM. They need this to work reliably at scale, a 2% parse failure rate would mean dozens of dropped leads per day.
Using function calling, they define an extraction schema with four typed fields, marking name and email as required. The API guarantees the output always contains at least those two fields in the correct format. They layer Pydantic on top to validate that the email field actually contains a valid email address (catching cases where the model extracts a company domain instead of a personal address). A retry loop with error feedback handles the remaining edge cases. Their end-to-end parse success rate goes from roughly 94% with prompt engineering to over 99.5% with function calling plus validation.
Advantages
- Pipeline reliability: Downstream systems, databases, APIs, dashboards, can depend on a consistent data shape rather than building fragile fallback parsers for every possible malformed response.
- Error transparency: When Pydantic validation fails, it produces a specific, human-readable error message that identifies exactly which field failed and why. This makes debugging fast and makes retry prompts informative.
- Agentic system foundations: Function calling is the mechanism that allows LLMs to trigger real-world actions, API calls, database writes, tool invocations. Reliable structured output is a prerequisite for any agentic workflow.
- Reduced post-processing cost: Without structured output guarantees, teams write extensive parsing logic with fallbacks for malformed responses. Function calling eliminates most of this boilerplate.
Limitations and Trade-offs
- Syntax correctness is not semantic correctness. JSON mode and function calling guarantee structural validity. They do not prevent the model from fabricating values that look correct, a plausible-sounding but invented email address will pass all format checks.
- Function calling requires API support. Older models and many open-source models lack native function calling. For these, you are back to prompt engineering plus manual parsing.
- Grammar-constrained generation requires infrastructure control. This approach only works for self-hosted or local models where you control the inference engine. It is not available through standard cloud APIs.
- Schema complexity adds brittleness. Very complex nested schemas with many conditional fields can confuse the model, causing it to fill required fields with placeholder values rather than extracting genuine information.
- Retry loops increase cost and latency. Feeding validation errors back to the model for correction adds API calls and introduces variable response times. For real-time applications, this is a meaningful trade-off.
Common Mistakes
- Trusting JSON mode alone for production. JSON mode guarantees parseable output, not correct output. Teams that stop at JSON mode still face missing required fields and wrong data types in their data pipelines.
- Skipping Pydantic validation after function calling. Function calling enforces schema shape but cannot prevent the model from hallucinating values. Always validate extracted data against your real data sources and business logic before writing to a database or calling an API.
- Not versioning schemas. Schema changes, adding a new required field, renaming a key, silently break downstream consumers if not tracked and communicated. Treat your extraction schemas like a versioned API contract.
- Using constrained syntax as a security substitute. Grammar-constrained generation enforces output shape only. A syntactically valid SQL query can still contain an injection attack. Always use parameterized queries and dedicated input sanitization independently of structured output validation.
- Allowing user input to modify schemas. If users can control function definitions or JSON schemas, an attacker can inject schema modifications that expose sensitive fields or bypass validation rules. Schemas must be developer-controlled.
Best Practices
- Use function calling for any production workflow where data will be written to a database or used to trigger an action.
- Always layer Pydantic validation on top of function calling, validate before using the data, not just before parsing it.
- Implement retry loops with error feedback. Feed the specific Pydantic validation error back to the model in the next turn rather than simply retrying with the same prompt.
- Keep schemas as simple as the task allows. Start with flat structures and add nesting only when the data genuinely requires it.
- Test your schemas against edge cases: empty inputs, fields that do not exist in the source text, inputs in unexpected languages, and ambiguous or contradictory data.
- Log all structured outputs for debugging and auditing. Include the raw model output before parsing so you can diagnose failures retrospectively.
- Monitor extraction success rates as a production metric. A drop in success rate is often an early signal of a model update or input distribution shift before it manifests as visible errors.
- Version your schemas and treat schema changes as breaking API changes that require consumer notification.
Comparison: Structured Output Methods
| Method | Syntax Guarantee | Schema Enforcement | Required Fields Guaranteed | Ease of Use | Best For |
|---|---|---|---|---|---|
| Prompt Engineering | No | No | No | Very easy | Prototyping and simple cases |
| JSON Mode | Yes | No | No | Easy | General JSON output with no strict schema |
| Function Calling | Yes | Yes | Yes | Medium | Production apps, agentic systems, critical data pipelines |
| Pydantic Validation | Post-check | Yes | Yes | Medium | Business logic validation layered on any method above |
| Grammar-Constrained Generation | Yes (hard) | Yes | Yes | Hard | Self-hosted models, non-JSON formats, zero-error requirements |
FAQ
If function calling guarantees the schema, why do I still need Pydantic?
Function calling guarantees the structure of the output, the fields are present, the types match the schema. It does not guarantee that the values are correct. The model can fabricate an email address that is syntactically valid but does not belong to the person in question, invent a product code that does not exist in your catalog, or fill in a required field with a placeholder value. Pydantic lets you add business logic checks, "this email must be a real address", "this product ID must exist in our database", that API-level schema enforcement cannot provide.
When should I use grammar-constrained generation instead of function calling?
Grammar-constrained generation is primarily for two scenarios. First, when you are running a self-hosted open-source model and have control over the inference engine, tools like llama.cpp and the Outlines library make this accessible. Second, when your output format is not JSON at all. If you need the model to produce SQL, a domain-specific language, or a custom structured format, a grammar gives you precise control that function calling cannot provide.
How many retries should I allow in a retry loop?
Two to three retries is the common production standard. Beyond three attempts, the model has typically demonstrated that it cannot extract the required information from the input text, continuing to retry adds cost without meaningfully increasing success rates. After three failures, it is better to route the input to a human review queue or return a structured error to the calling system rather than looping indefinitely.
Does JSON mode work with all LLM providers?
No. JSON mode is available in OpenAI's API using the response_format parameter, and in several other major providers. Anthropic's API does not have a dedicated JSON mode toggle; structured output is achieved through tool use (function calling), which constrains the model to return a defined JSON schema. Many open-source models and smaller API providers do not support constrained output natively. For models without such support, prompt engineering combined with robust post-processing and retry logic is the fallback approach.
Can I use structured output for classification tasks?
Yes, and it is often the best approach. Rather than asking the model to return a category name as free text, which can produce spelling variations, extra whitespace, or synonyms, you define a function schema with an enum field listing the valid categories. The model is constrained to pick exactly one of the allowed values. This is far more reliable than string-matching on free-text responses.
References
- OpenAI (2023). Function Calling and JSON Mode. platform.openai.com
- Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv:2307.09702.
- Pydantic Documentation. Data Validation for Python
- Outlines. Structured Text Generation Library
- Chase, H. (2023). LangChain Output Parsers. python.langchain.com
Key Takeaways
- JSON mode guarantees syntactically valid output but not schema compliance. For production reliability, use function calling, which enforces required fields and types at the API level.
- Always layer Pydantic validation on top of function calling: the model can still produce structurally valid but semantically wrong values, fabricated emails, invented IDs, that only business logic checks can catch.
- Implement retry loops with error feedback. Feeding the specific Pydantic validation failure back to the model in the next turn dramatically improves success rates compared to blind retries.
- For self-hosted models or non-JSON formats like SQL or custom domain-specific languages, grammar-based constrained generation provides hard syntax guarantees, but remember that syntactic validity is not the same as safe input; always use parameterized queries separately.
- Version your schemas and treat them like API contracts. Unannounced schema changes silently break downstream consumers in ways that are difficult to diagnose.
- Monitor extraction success rates as a live production metric. A declining success rate often signals a model update or input distribution shift before it appears as visible application errors.
Related Articles