Skip to content
Ryan de Melo
Go back

Evals Are the New Unit Tests (And You're Not Writing Them)

Ask a team shipping an LLM feature how they know a change made things better. You will hear some version of “we tried a few prompts and it looked good.” That is the whole test suite. That is vibe-checking, and we would fire someone for doing the equivalent to a payments service.

I have watched smart teams put real money behind a model output and then change the prompt on a Friday because someone had a hunch, with nothing to catch the regression. They would never merge a change to the billing path without a test going red first. But the model is new and squishy and nobody told them what the test even looks like here, so they ship on feeling. Then a prompt tweak that fixed one case quietly broke nine others, and they find out from a support ticket three weeks later.

Here is the claim. An eval is a unit test for a probabilistic function. You have written thousands of unit tests. You can write these. The reason you haven’t is that nobody has shown you the boring version, so you assumed it needed a platform.

The trap is that it always looks fine

The reason vibe-checking survives is that an LLM output almost always looks plausible. A broken function throws. A broken SQL query returns an error. A broken prompt returns a confident, fluent, well-formatted paragraph that happens to be wrong, and your eye slides right over it because it reads like the right answer.

So you tweak the prompt to fix the one case a stakeholder complained about. It works. You ship. What you cannot see, because you only checked the one case, is that the same tweak shifted the model’s behavior on a dozen inputs you didn’t look at. There was no red bar. There is never a red bar unless you build one.

(The dirty part is that the demo always works. The demo is three inputs the author already knows the model handles. Production is the long tail of inputs nobody rehearsed.)

The boring version that actually ships

You do not need a framework. You need a list of real cases, a way to grade an output, and a number that has to not go down. That is it.

Start with the cases. Pull them from real usage, not your imagination. Twenty is enough to begin, and twenty real ones beat two hundred synthetic ones. Each case is an input and some statement of what a good output must contain.

# evals/cases.py
# These are real questions pulled from the support log, with the
# ground truth a human actually confirmed. Not made up. The whole
# value is that these inputs really happened and really broke.

CASES = [
    {
        "id": "refund-window-eu",
        "input": "How long do I have to request a refund in Germany?",
        # must mention the 14-day statutory window, must NOT invent a number
        "must_include": ["14 day"],
        "must_not_include": ["30 day", "60 day"],
    },
    {
        "id": "refund-after-use",
        "input": "Can I get a refund after I've used the product?",
        # the honest answer is "it depends, here is the rule," not a flat yes
        "must_include": ["depends"],
        "must_not_include": ["yes, you can always"],
    },
]

Most of your grading does not need another model. A surprising amount of real eval work is substring checks, regex, “does it parse as JSON,” “is the refund amount within tolerance,” “did it refuse when it should have refused.” These are deterministic, free, and they never flake. Reach for them first.

# evals/run.py
from cases import CASES
from app import answer  # the function you actually ship

def grade(output, case):
    text = output.lower()
    for phrase in case.get("must_include", []):
        if phrase.lower() not in text:
            return False, f"missing required phrase: {phrase!r}"
    for phrase in case.get("must_not_include", []):
        if phrase.lower() in text:
            return False, f"contained forbidden phrase: {phrase!r}"
    return True, "ok"

def main():
    passed, failures = 0, []
    for case in CASES:
        output = answer(case["input"])
        ok, reason = grade(output, case)
        if ok:
            passed += 1
        else:
            failures.append((case["id"], reason))

    total = len(CASES)
    print(f"{passed}/{total} passed")
    for cid, reason in failures:
        print(f"  FAIL {cid}: {reason}")

    # fail the build. an eval you can ignore is a comment, not a test.
    if failures:
        raise SystemExit(1)

if __name__ == "__main__":
    main()

Wire that into CI on the prompt and on the app code. Now a prompt change that breaks the refund cases turns the build red, same as any other regression, and the person who broke it finds out in four minutes instead of three weeks.

The cases are the asset. Every time a real output is wrong in production, you do not just patch it. You add it to CASES first, watch the eval go red, then fix it and watch it go green. That is how the suite grows to cover the actual shape of your traffic instead of the shape you imagined.

Where the model has to grade the model

Some things you cannot check with a substring. Tone, whether an answer is grounded in the provided context, whether a summary kept the load-bearing facts. For those you hand the output to another model and ask it to judge. LLM-as-judge. It works better than it has any right to, and it will also lie to you, so here is the honest part.

The judge is biased toward longer answers and toward outputs that look like its own writing. It is not deterministic, so the same output can score 4 one run and 5 the next. And it has no idea what your business considers correct unless you tell it, in detail, with a rubric.

JUDGE_PROMPT = """You are grading a support answer. Be strict.
Score 1-5 ONLY on whether the answer is grounded in the provided
policy text. Ignore tone, length, and how confident it sounds.

Policy: {policy}
Answer: {answer}

Return JSON: {{"score": <int>, "reason": "<one sentence>"}}.
A claim not supported by the policy text is an automatic 1."""

Three rules keep the judge honest. Score one narrow thing per call, never “overall quality,” because overall quality is where the bias hides. Calibrate it against fifteen or twenty examples you graded by hand, and if the judge disagrees with you, fix the rubric until it agrees, because a judge you have not calibrated is just a second opinion you didn’t ask for. And run it more than once on the cases that matter, then take the worst score, not the average. You care about the floor.

Treat the judge as a noisy smoke alarm, not a certified inspector. It is very good at catching the answer that is obviously ungrounded. It is bad at the difference between a 4 and a 5, so do not gate releases on small score moves you cannot reproduce.

The whole thing fits in two files and an afternoon. You already believe an untested payments change is reckless. An untested prompt sitting in front of your customers is the same bet, made with the same money, and right now you are making it on a hunch. What is your refund-window case, and why isn’t it red yet?


Share this post:

Previous Post
Stop Fine-Tuning. Start Retrieving. (Usually.)
Next Post
Your RAG Is Bad Because Your Chunking Is Bad