Agents Are Coming. Most Demos Are Lying.

On stage, someone types a fuzzy request in plain English. The agent thinks out loud, calls a tool, calls another, books the thing, files the thing, and announces success with a little flourish. The room nods, and someone asks how soon this can be in production.

The demo always works, and that is the first thing you should distrust about it. I have watched a dozen agent demos this year and every one followed the same arc. I sit there doing the arithmetic nobody on stage is doing, because the demo ran one happy path one time, and production is the same path run ten thousand times against a world that pushes back.

An agent is a loop, and a loop multiplies its weakest step. If each step in a chain succeeds ninety-five times out of a hundred, that sounds fine until you put eight of them in a row, where the odds of a clean run drop to about two in three. A third of your runs fail somewhere, and they do not fail politely: they fail in the middle, three tool calls deep, having already moved something in the real world that now has to be moved back.

Compounding error is the whole problem

Single-shot LLM calls are forgiving. You ask a question, you get an answer, and if it is wrong you see it and try again, so the blast radius stays at one response.

An agent is different in kind, not degree. It takes its own previous output as the next input, and a small wrong turn early gets amplified by everything downstream that trusts it. The model misreads a date, decides the invoice is overdue, drafts a dunning email, queues it, and updates a status field, all internally consistent and all built on the one bad read at step one. None of the later steps catch the error because none of them were checking the premise; they were executing it.

This is what the demos hide by being short: three steps rarely compound enough to break, and the interesting work is never three steps.

There is no recovery, only retry

Watch what an agent does when a step fails, and you learn how good it actually is.

Mostly what it does is try again, the same approach, maybe slightly reworded, because the only recovery strategy it has is “the model will figure it out this time.” Sometimes that works; more often it loops, burning tokens and minutes, until it hits a step cap and gives up, or worse, talks itself into a fabricated success and reports done. A wrong step does not announce itself, and the agent cannot reliably tell good steps from bad ones, because if it could, it would not have taken the bad one. (The same flaw that caused the error is the one you are asking to detect it.)

Real recovery means knowing what state you are in, what changed, and how to undo it. Agents in late 2024 are weak at all three. They have no durable sense of state beyond what fits in the context window, and the context window is a goldfish.

The loop has a bill, and it is not small

Cost and latency are where the demo really lies, because a demo is free and you are not watching the clock.

Every step in the loop is a model call, and every model call drags the entire growing transcript along with it. The conversation, the tool definitions, the results of every prior call, all resent on each turn. A chain that runs twelve steps is not twelve cheap calls but twelve increasingly expensive ones, because the prompt grows with each turn, and it runs up a token bill that scales with how stuck the agent got. Put that behind a user who is waiting, and “it works” turns into “it works if you are patient and rich.”

I ran the numbers on an internal automation we were tempted to hand to an agent. It was correct slightly more often than the scripted version and cost an order of magnitude more per run, with a latency distribution that had a long, ugly tail. We kept the script.

Tools fail silently, and the agent believes them

This one is subtle and it bit us hardest.

When a tool call fails, it does not always throw. Sometimes it returns an empty list, a stale cache, or a default object that looks real. A human reads [] and thinks “that is suspicious, let me check.” The agent reads [] and concludes there are zero matching records, which is a perfectly valid conclusion from the data it was handed, and proceeds confidently to do the wrong thing for an excellent reason. The model is only as honest as its tools, and most tools were written to be called by code that checks return values, not by a model that takes them at face value.

So you wrap every tool in a layer that makes failure loud, distinguishes “no results” from “the query errored,” and refuses to let a malformed result pass as a normal one. That wrapping is most of the real engineering; the agent, as always, is the easy part.

Where they actually earn their keep

I am skeptical without being dismissive, because there are shapes of problem where agents already pay for themselves today, and they all look alike once you notice the pattern: the chain is short, the tools are bounded, a human owns the judgment call, and every action is reversible.

The safe version is drafting instead of sending. An agent that reads a support ticket, pulls the relevant history, and writes a suggested reply for a human to approve is genuinely useful, because the human is the send button and a bad draft costs nothing. Triage and routing, where the worst case is a misfiled ticket someone reassigns. Code that an agent proposes as a pull request, never a merge, because the diff is a checkpoint and the tests are a gate. Anything where the agent gathers and proposes and a person commits the irreversible step.

That is a narrower promise than the demo makes, and it is also the version that ships.

What I am watching for

Trustworthy agents will not come from a smarter model. A smarter model raises the per-step success rate, which helps, but it does not change the shape of the problem, because you will just point it at longer chains and arrive at the same failure math. What changes the shape is durable state, real recovery, and tools that fail loudly. The unglamorous infrastructure, in other words, which is the same lesson every previous wave of automation taught us.

When someone shows me an agent that handles its own failures gracefully, that knows what it changed and can put it back, that gets cheaper to run as the task gets clearer, I will be the first to retire this post. I have not seen that agent yet, and so far all I have seen is a lot of demos.