Evals Are the New Unit Tests (And You're Not Writing Them)

The prompt change that kicked all this off went in on a Friday, on somebody’s hunch. When I asked the team how they knew it had made anything better, the best answer anyone gave me was “we tried a few prompts and it looked good.” That was the entire test suite: vibe-checking, which on a payments service would get someone fired.

These were not careless engineers, just people who had real money riding on the model’s output and would never have merged a change to the billing path without watching a test go red first. The model was new and squishy, though, and nobody had told them what a test even looks like here, so they shipped on feeling. The tweak that quieted one complaint broke nine other cases nobody was watching, and the team heard about it from a support ticket three weeks later.

An eval is just a unit test for a probabilistic function. You have written thousands of those, which means you can already write these; the only reason you haven’t is that nobody showed you the boring version, so you assumed the job needed a platform.

The trap is that it always looks fine

Vibe-checking survives because an LLM output almost always looks plausible. A broken function throws; a broken SQL query returns an error. A broken prompt hands you a confident, fluent, well-formatted paragraph that happens to be wrong, and your eye slides right over it because it reads like the right answer.

You tweak the prompt to fix the one case a stakeholder complained about, it works, and you ship. What you cannot see, because the one case is all you checked, is that the same tweak shifted the model’s behavior on a dozen inputs you never looked at. There was no red bar, because there is never a red bar here unless you build one.

(This is why the demo always works. A demo is three inputs the author already knows the model handles; production is the long tail nobody rehearsed.)

The boring version that actually ships

You do not need a framework, only three things: a list of real cases, a way to grade an output, and a number that has to not go down.

Start with the cases, and pull them from real usage rather than your imagination. Twenty is enough to begin, and twenty real ones beat two hundred synthetic ones. Each case is an input plus some statement of what a good output has to contain.

# evals/cases.py
# These are real questions pulled from the support log, with the
# ground truth a human actually confirmed. Not made up. The whole
# value is that these inputs really happened and really broke.

CASES = [
    {
        "id": "refund-window-eu",
        "input": "How long do I have to request a refund in Germany?",
        # must mention the 14-day statutory window, must NOT invent a number
        "must_include": ["14 day"],
        "must_not_include": ["30 day", "60 day"],
    },
    {
        "id": "refund-after-use",
        "input": "Can I get a refund after I've used the product?",
        # the honest answer is "it depends, here is the rule," not a flat yes
        "must_include": ["depends"],
        "must_not_include": ["yes, you can always"],
    },
]

Most of your grading does not need another model at all. A surprising amount of real eval work is substring checks, regex, “does it parse as JSON,” “is the refund amount within tolerance,” “did it refuse when it should have refused.” Those checks are deterministic, free, and they never flake, so reach for them before anything fancier.

# evals/run.py
from cases import CASES
from app import answer  # the function you actually ship

def grade(output, case):
    text = output.lower()
    for phrase in case.get("must_include", []):
        if phrase.lower() not in text:
            return False, f"missing required phrase: {phrase!r}"
    for phrase in case.get("must_not_include", []):
        if phrase.lower() in text:
            return False, f"contained forbidden phrase: {phrase!r}"
    return True, "ok"

def main():
    passed, failures = 0, []
    for case in CASES:
        output = answer(case["input"])
        ok, reason = grade(output, case)
        if ok:
            passed += 1
        else:
            failures.append((case["id"], reason))

    total = len(CASES)
    print(f"{passed}/{total} passed")
    for cid, reason in failures:
        print(f"  FAIL {cid}: {reason}")

    # fail the build. an eval you can ignore is a comment, not a test.
    if failures:
        raise SystemExit(1)

if __name__ == "__main__":
    main()

Wire that into CI, on the prompt and on the app code. A prompt change that breaks the refund cases now turns the build red, the same as any other regression, and whoever broke it finds out in four minutes instead of three weeks.

The cases themselves are the real asset here. When a real output is wrong in production, you don’t just patch it; you add it to CASES first, watch the eval go red, then fix it and watch it go green. The suite grows that way, until it covers the actual shape of your traffic instead of the shape you imagined.

Where the model has to grade the model

Some things a substring cannot catch: tone, whether an answer is grounded in the provided context, whether a summary kept the load-bearing facts. You hand the output to another model and ask it to judge, a trick people call LLM-as-judge. It works better than it has any right to, and it will also lie to you.

The judge leans toward longer answers and toward outputs that look like its own writing. It is not deterministic, so the same output can score 4 on one run and 5 on the next. Worse, it has no idea what your business considers correct unless you tell it, in detail, with a rubric.

JUDGE_PROMPT = """You are grading a support answer. Be strict.
Score 1-5 ONLY on whether the answer is grounded in the provided
policy text. Ignore tone, length, and how confident it sounds.

Policy: {policy}
Answer: {answer}

Return JSON: {{"score": <int>, "reason": "<one sentence>"}}.
A claim not supported by the policy text is an automatic 1."""

Three rules keep the judge honest. First, score one narrow thing per call, never “overall quality,” because overall quality is where the bias hides. Second, calibrate it against fifteen or twenty examples you graded by hand; if the judge disagrees with you, fix the rubric until it agrees, because a judge you have not calibrated is just a second opinion you didn’t ask for. Third, run it more than once on the cases that matter and take the worst score rather than the average, because the floor is what you care about.

Treat the judge as a noisy smoke alarm, not a certified inspector. It reliably catches the answer that is obviously ungrounded, and it is hopeless at the gap between a 4 and a 5, so do not gate releases on small score moves you cannot reproduce.

The whole thing fits in two files and an afternoon. You already treat an untested payments change as reckless, and an untested prompt in front of your customers is the same bet, the same money, placed on the same hunch. With the harness in place, the next Friday change trips a red bar in four minutes, and the support ticket three weeks out never gets written.