Getting JSON Out of LLMs Without Crying

The horror story everyone reaches for is the model returning broken JSON. The time it actually bit me, the JSON was immaculate. Our structured-output feature went down, and every field was present, every type was correct, and the model had cheerfully invoiced a customer in a currency that does not exist, at a tax rate it invented. The shape was so clean that nothing downstream blinked. We heard about it three days later, from finance.

Getting a model to emit well-formed JSON is mostly solved now. Getting it to emit true JSON is not, and no amount of constrained decoding will solve it for you.

The honest hierarchy has five rungs, and every team I talk to skips straight to the wrong one.

Rung one: ask nicely, and don’t trust it

The oldest move is to put the schema in the prompt and beg. “Respond with JSON matching this shape, no prose, no markdown fences.” It works often enough to demo and fails often enough to page you. The model wraps the JSON in a code fence one time in fifty, adds a chatty preamble, trails a comment, or decides today is the day for single quotes.

If prompting is your only layer, you end up writing a parser that strips fences, hunts for the first {, balances braces by hand, and slowly turns into the saddest state machine you have ever maintained. I have written that parser twice, and the second time cured me of writing it a third.

The fix is not to prompt better. Prompting controls content and is hopeless at guaranteeing format, so use it for the former and get your format guarantee from something stronger.

Rung two: make the format impossible to get wrong

This is where 2024 is genuinely better than 2023. You have real options for forcing valid structure instead of hoping for it.

Function calling (now sold as tool calls) lets you hand the model a JSON Schema and get back arguments that conform to it. JSON mode on the chat endpoint constrains output to syntactically valid JSON. Underneath both, constrained decoding masks the token logits at each step so the model can only sample tokens that keep the output grammatically legal. If you self-host, you can drive this directly: a grammar or a Pydantic schema, and the sampler physically cannot emit a token that breaks it.

That last one is the real magic, and it is worth understanding what it does and does not buy you. Constrained decoding makes malformed JSON impossible. The output parses and the types match, and none of that touches whether the values are correct, because masking logits enforces grammar, not truth. A grammar that says amount is a number is perfectly satisfied by -1 for a payment.

Pick the strongest format guarantee your stack supports and stop hand-parsing. The boring scaffold I actually reach for looks like this, with the schema written as the contract it is:

from pydantic import BaseModel, field_validator

class Invoice(BaseModel):
    customer_id: str
    currency: str          # ISO 4217, validated below, NOT trusted from the model
    amount_cents: int      # cents, because float money is how you get sued
    line_items: list[str]

    @field_validator("currency")
    @classmethod
    def known_currency(cls, v):
        # the model will happily invent "USDX" and mean it sincerely
        if v not in SUPPORTED_CURRENCIES:
            raise ValueError(f"unsupported currency: {v}")
        return v

    @field_validator("amount_cents")
    @classmethod
    def sane_amount(cls, v):
        # negative or absurd amounts are a tell that the model lost the plot
        if v <= 0 or v > 100_000_00:
            raise ValueError(f"amount out of bounds: {v}")
        return v

Notice the division of labor. The types come from JSON mode or constrained decoding; the validators are mine, encoding what the grammar cannot: this currency is real, this amount is plausible, this id exists. That line between format and meaning is what the rest of this comes down to.

Rung three: validate against reality, not just types

A schema that only checks types is a smoke detector with the battery taken out. It will pass the exact outputs that hurt you.

Type-valid and wrong is the dominant failure mode in structured output, and it is more dangerous than malformed JSON, not less, precisely because it parses. Malformed JSON throws an exception and you handle it. Confident wrong JSON sails through every layer that only asks “does this parse” and lands in a database, an email, a wire transfer.

Validation has to reach past shape into meaning. Does this customer_id exist in our system, or did the model hallucinate a plausible-looking string. Is this currency one we actually support. Does the sum of line items match the total the model also produced (models are bad at arithmetic and worse when they have to keep two numbers consistent). These are business checks, not schema checks, and they are the ones that catch the invented invoice.

def validate_semantics(inv: Invoice) -> list[str]:
    problems = []
    if not customer_exists(inv.customer_id):
        problems.append(f"unknown customer {inv.customer_id}")
    # the classic: model emits line items AND a total, and they disagree.
    # trust neither; recompute and demand they agree.
    computed = sum(price_of(item) for item in inv.line_items)
    if computed != inv.amount_cents:
        problems.append(f"total {inv.amount_cents} != sum of items {computed}")
    return problems

If validate_semantics returns nothing, you have something worth acting on. If it returns problems, you do not throw the model output away yet. You give it one chance to fix itself.

Rung four: repair, with a tight leash

Repair is sending the bad output back with the specific complaint and asking for a corrected version. It works surprisingly well, because the model is usually close, not lost. The trick is to feed it the exact validator errors, not a vague “that was wrong,” and to cap the attempts hard.

def extract_invoice(prompt, max_repairs=2):
    messages = [{"role": "user", "content": prompt}]
    for attempt in range(max_repairs + 1):
        raw = call_model(messages, response_format=Invoice)  # constrained decode
        try:
            inv = Invoice.model_validate_json(raw)
            problems = validate_semantics(inv)
        except ValueError as e:
            problems = [str(e)]
            inv = None
        if not problems:
            return inv
        # hand back the precise grievance so the model can target it.
        # vague feedback gets vague fixes.
        messages.append({"role": "assistant", "content": raw})
        messages.append({"role": "user",
            "content": f"That failed validation: {problems}. Fix only these and resend."})
    # out of retries. do NOT return a best guess. fail loud.
    raise StructuredOutputError(f"gave up after {max_repairs} repairs: {problems}")

Cap it at two attempts, three at the outside. If the model cannot produce a valid invoice in three tries, a fourth is not going to find enlightenment, it will just spend tokens and add latency while the user waits. (I once watched an uncapped repair loop retry forty times on a malformed request, because the prompt was the problem, not the model, and a loop cannot fix your prompt.)

Rung five: fail loudly

This is the rung people are most tempted to skip, because failing loud feels like shipping something broken, and it is exactly backwards. The worst outcome in a structured-output pipeline is not an error but a confident, well-formed, semantically wrong object that no one notices until it has done damage. An exception here means the feature is working: your validators caught what the format guarantee could not, and refused to let garbage through.

When you run out of repairs, raise. Log the raw output, the schema, and the validator errors, because that triple is exactly what you need to debug it later. Surface a clean “I could not produce a reliable result” to the user. Never paper over it with a default object or a partial parse. A wrong invoice is worse than no invoice.

Why the ladder needs all five rungs

Every layer exists because the one beneath it cannot finish the job alone. Prompting shapes content and promises nothing about format. Constrained decoding delivers that format and stays blind to truth. Validation is where truth gets checked, though checking is all it can do; it never repairs what it finds. Repair fixes what it can, inside the two or three tries you allow it. Under all of them sits the loud failure, the net that keeps a wrong answer out of the systems that would act on it.

The clearest sign a team has never run this in production is a regex parsing model output. If you reach for re.search to pull JSON out of a model’s reply, you are on rung zero, fighting a battle constrained decoding already won for free. Above that rung sit the validators, and they are not optional, because right now, somewhere, a model is composing a beautiful, valid, completely fabricated invoice, in a currency that does not exist, at a tax rate it invented, and every field will be present, and every type will be correct, and the JSON will be perfect, and we will hear about it three days later, from finance.