Skip to content
Ryan de Melo
Go back

Getting JSON Out of LLMs Without Crying

The first time a structured-output feature went down in production, it was not because the model returned broken JSON. The JSON parsed fine. Every field was present, every type was correct. The model had cheerfully invoiced a customer in a currency that does not exist, with a tax rate it invented, and the shape was so clean that nothing downstream blinked. We found out three days later from finance.

That is the whole problem in one sentence. Getting a model to emit well-formed JSON is mostly solved now. Getting it to emit true JSON is not, and no amount of constrained decoding will solve it for you.

Let me walk the honest hierarchy, because every team I talk to skips straight to the wrong rung.

Rung one: ask nicely, and don’t trust it

The oldest move is to put the schema in the prompt and beg. “Respond with JSON matching this shape, no prose, no markdown fences.” It works often enough to demo and fails often enough to page you. The model wraps the JSON in a code fence one time in fifty. It adds a chatty preamble. It trails a comment. It decides today is the day for single quotes.

If prompting is your only layer, you end up writing a parser that strips fences, hunts for the first {, balances braces by hand, and slowly turns into the saddest state machine you have ever maintained. I have written that parser. Twice. Do not write that parser.

The lesson is not “prompt better.” The lesson is that prompting controls content and is hopeless at guaranteeing format. Use it for the former. Get the format guarantee somewhere stronger.

Rung two: make the format impossible to get wrong

This is where 2024 is genuinely better than 2023. You have real options for forcing valid structure instead of hoping for it.

Function calling (now sold as tool calls) lets you hand the model a JSON Schema and get back arguments that conform to it. JSON mode on the chat endpoint constrains output to syntactically valid JSON. And underneath both, constrained decoding masks the token logits at each step so the model can only sample tokens that keep the output grammatically legal. If you self-host, you can drive this directly: a grammar or a Pydantic schema, and the sampler physically cannot emit a token that breaks it.

That last one is the real magic, and it is worth understanding what it does and does not buy you. Constrained decoding makes malformed JSON impossible. The output will parse. The output will match your types. It changes nothing about whether the values are correct, because masking logits enforces grammar, not truth. A grammar that says amount is a number is perfectly satisfied by -1 for a payment.

So pick the strongest format guarantee your stack supports and stop hand-parsing. Here is the boring scaffold I actually reach for, with the schema written as the contract it is:

from pydantic import BaseModel, field_validator

class Invoice(BaseModel):
    customer_id: str
    currency: str          # ISO 4217, validated below, NOT trusted from the model
    amount_cents: int      # cents, because float money is how you get sued
    line_items: list[str]

    @field_validator("currency")
    @classmethod
    def known_currency(cls, v):
        # the model will happily invent "USDX" and mean it sincerely
        if v not in SUPPORTED_CURRENCIES:
            raise ValueError(f"unsupported currency: {v}")
        return v

    @field_validator("amount_cents")
    @classmethod
    def sane_amount(cls, v):
        # negative or absurd amounts are a tell that the model lost the plot
        if v <= 0 or v > 100_000_00:
            raise ValueError(f"amount out of bounds: {v}")
        return v

Notice what the schema is doing. The types come from JSON mode or constrained decoding. The validators are mine, and they encode things the grammar cannot: this currency is real, this amount is plausible, this id exists. That distinction is the entire post.

Rung three: validate against reality, not just types

Here is the part nobody tells you. A schema that only checks types is a smoke detector with the battery taken out. It will pass the exact outputs that hurt you.

Type-valid and wrong is the dominant failure mode in structured output, and it is more dangerous than malformed JSON, not less, precisely because it parses. Malformed JSON throws an exception and you handle it. Confident wrong JSON sails through every layer that only asks “does this parse” and lands in a database, an email, a wire transfer.

So validation has to reach past shape into meaning. Does this customer_id exist in our system, or did the model hallucinate a plausible-looking string. Is this currency one we actually support. Does the sum of line items match the total the model also produced (models are bad at arithmetic and worse when they have to keep two numbers consistent). These are not schema checks. They are business checks, and they are the ones that catch the invented invoice.

def validate_semantics(inv: Invoice) -> list[str]:
    problems = []
    if not customer_exists(inv.customer_id):
        problems.append(f"unknown customer {inv.customer_id}")
    # the classic: model emits line items AND a total, and they disagree.
    # trust neither; recompute and demand they agree.
    computed = sum(price_of(item) for item in inv.line_items)
    if computed != inv.amount_cents:
        problems.append(f"total {inv.amount_cents} != sum of items {computed}")
    return problems

If validate_semantics returns nothing, you have something worth acting on. If it returns problems, you do not throw the model output away yet. You give it one chance to fix itself.

Rung four: repair, with a tight leash

Repair is sending the bad output back with the specific complaint and asking for a corrected version. It works surprisingly well, because the model is usually close, not lost. The trick is to feed it the exact validator errors, not a vague “that was wrong,” and to cap the attempts hard.

def extract_invoice(prompt, max_repairs=2):
    messages = [{"role": "user", "content": prompt}]
    for attempt in range(max_repairs + 1):
        raw = call_model(messages, response_format=Invoice)  # constrained decode
        try:
            inv = Invoice.model_validate_json(raw)
            problems = validate_semantics(inv)
        except ValueError as e:
            problems = [str(e)]
            inv = None
        if not problems:
            return inv
        # hand back the precise grievance so the model can target it.
        # vague feedback gets vague fixes.
        messages.append({"role": "assistant", "content": raw})
        messages.append({"role": "user",
            "content": f"That failed validation: {problems}. Fix only these and resend."})
    # out of retries. do NOT return a best guess. fail loud.
    raise StructuredOutputError(f"gave up after {max_repairs} repairs: {problems}")

Two attempts. Maybe three. If the model cannot produce a valid invoice in three tries, a fourth is not going to find enlightenment, it is just going to spend tokens and add latency while the user waits. (I have watched a repair loop with no cap retry forty times on a malformed request because the prompt was the problem, not the model. The loop cannot fix your prompt.)

Rung five: fail loudly, on purpose

This is the rung people are most tempted to skip, because failing loud feels like shipping something broken. It is the opposite. The worst outcome in a structured-output pipeline is not an error. It is a confident, well-formed, semantically wrong object that no one notices until it has done damage. The exception is the feature working. It means your validators caught what the format guarantee could not, and refused to let garbage through.

So when you run out of repairs, raise. Log the raw output, the schema, and the validator errors, because that triple is exactly what you need to debug it later. Surface a clean “I could not produce a reliable result” to the user. Never paper over it with a default object or a partial parse. A wrong invoice is worse than no invoice.

The thing to actually take away

Every layer here exists because the one below it cannot do the job alone. Prompting shapes content and guarantees nothing about format. Constrained decoding guarantees format and knows nothing about truth. Validation checks truth and cannot fix what it finds. Repair fixes what it can and has to know when to quit. And failing loud is what keeps the wrong answers out of the systems that matter.

The single tell of a team that has not run this in production is parsing model output with a regex. If you are reaching for re.search to pull JSON out of a model’s reply, you are on rung zero, fighting a battle that constrained decoding already won for free. Climb the ladder. Then write the validators, because the model will hand you a beautiful, valid, completely fabricated invoice, and it will do it with total confidence, and the JSON will be perfect.


Share this post:

Previous Post
Post-Merger Tech Integration: Ten Systems, Nine Months, Zero Downtime
Next Post
Negotiating a Nine-Figure Cloud Deal: What Engineers Should Know