An agent issued a refund to the wrong account in a staging run, and the only reason it didn’t reach production was that staging pointed at a sandbox ledger. The logs said it did exactly what it was told. It found a customer, matched an invoice, called the refund tool, and reported success. Every step was reasonable. The composition was wrong, and nothing in the system was built to catch a wrong composition of right steps.
That is the whole problem with agents that act on real systems. Each individual decision looks defensible in the trace. The failure lives in the sequence, and the sequence is exactly the part you handed to a model that will, a few percent of the time, do something fluent and confident and wrong.
So when people ask how I let an agent touch money, the honest answer is that I don’t let it. I let it ask. Autonomy is a dial, not a switch, and you earn each click of it with evidence, not optimism.
The tool surface is the blast radius
The first mistake I see is handing an agent a client object. Someone wraps the whole payments SDK, or the whole internal API, gives the agent the credentials, and writes a system prompt that says “only issue refunds when appropriate.” That prompt is not a control. It is a suggestion to a system that does not reliably follow suggestions.
The tool surface you expose is the maximum damage the agent can do, full stop. If delete_customer is reachable, assume it will eventually get called on the wrong customer. The fix is boring and it works: expose an allowlist of narrow, single-purpose tools, not a transport to the underlying API.
# The agent gets THESE four functions. Not the payments client.
# Every tool here is single-purpose and validates its own inputs,
# because the model is not the thing I trust to validate them.
AGENT_TOOLS = {
"lookup_invoice": lookup_invoice, # read-only
"lookup_customer_balance": lookup_balance, # read-only
"propose_refund": propose_refund, # writes a PROPOSAL, not a refund
"list_refund_reasons": list_reasons, # read-only
}
# Notice what is missing: there is no issue_refund, no charge_card,
# no adjust_ledger. The agent literally cannot name those operations.
# Irreversible actions don't live in the agent's vocabulary at all.
The agent can read freely. Reads are cheap to get wrong. The one write it can reach does not move money, it creates a proposal. The thing that actually moves money lives behind a gate the agent cannot open on its own, and that gate is the next piece.
A proposal is not an action
Here is the part nobody tells you about approval gates. The point is not the human clicking yes. The point is forcing the agent to commit to a fully specified action before anyone, human or policy, decides on it. A vague intent cannot be checked. A concrete proposal can.
So the irreversible path always runs in two beats. The agent produces a proposal: a typed, complete description of what it wants to happen, with the exact amount, the exact account, the reason, and the idempotency key already chosen. Then a separate checkpoint decides whether that specific proposal proceeds. The agent never holds both halves.
from dataclasses import dataclass
from decimal import Decimal
@dataclass(frozen=True)
class RefundProposal:
invoice_id: str
amount: Decimal
currency: str
reason: str
idempotency_key: str # chosen at proposal time so a retry can't double-pay
def with_checkpoint(action_name, policy):
"""Wrap an irreversible action so it cannot fire without clearing policy.
The agent calls the wrapped version. It never sees the raw effect."""
def wrapper(proposal, ctx):
decision = policy.evaluate(action_name, proposal, ctx)
audit.record(action_name, proposal, decision, ctx) # before anything fires
if decision.outcome == "auto_approved":
return _commit(action_name, proposal, ctx)
if decision.outcome == "needs_human":
# Park it. The agent run ENDS here for this branch.
# A human resolves it out of band; we don't block a token loop
# waiting on a person to wake up.
return parked(decision.review_id, proposal)
# Default is no. An action that isn't explicitly allowed is denied,
# not allowed-because-nothing-said-no.
return denied(decision.reason)
return wrapper
The shape that matters: deny is the default. If the policy has nothing to say about a proposal, the proposal does not proceed. I have watched teams build the inverse, where the gate only stops things it was told to stop, and every new action type ships wide open until someone notices. Build the gate so silence means no.
What the policy actually checks
The policy is where your real rules live, and they are dumber than people expect. Thresholds, allowlists, rate limits. You do not need a model to decide whether to escalate; you need a model to do the messy work of reading invoices, and arithmetic to decide whether a human looks at the result.
def evaluate(self, action, proposal, ctx):
# Small, reversible-ish amounts the agent has earned the right to auto-run.
if proposal.amount <= self.auto_limit and ctx.refunds_today < self.daily_cap:
return Decision("auto_approved")
# Anything above the line, or anything touching an account that's
# been flagged, goes to a person. This is not the agent's call.
if proposal.amount > self.auto_limit or ctx.account_flagged:
return Decision("needs_human", review_id=open_review(proposal))
# Tripped a rate limit? Stop. A correct agent doesn't issue forty
# refunds in a minute; a looping one does, and that's exactly when
# you want the brakes to be mechanical, not advisory.
return Decision("denied", reason="rate_limit")
The auto-approve limit starts at zero. On day one every proposal goes to a human, and you watch. When you have a few hundred proposals the policy auto-approved in a dry run and a human agreed with every one, you raise the limit a notch. That is what earning a click of the dial looks like. Not a vibe that the agent “seems reliable now,” but a stack of decisions you checked.
Dry-run is not a testing afterthought
Every tool that has an effect gets a simulate mode wired in from the start, not bolted on later. In simulate mode the proposal runs the full path, policy and all, and the commit step returns what it would have done instead of doing it. This is how you let an agent run a thousand times against real data and produce zero real effects, which is the only honest way to measure it before you trust it.
def _commit(action, proposal, ctx):
if ctx.mode == "simulate":
# Full path, real validation, real policy, no money moves.
# The output is identical in shape to a real commit so the
# agent can't tell the difference and behave differently.
return SimulatedResult(action, proposal, would_have="committed")
return effects[action](proposal) # the real, irreversible thing
The detail that makes this work is that the agent gets back a result identical in shape to the real one. If simulate returns a different structure, the agent learns to behave differently in simulation than in production, and your whole evaluation is measuring a system you will never actually run. (I have made this mistake. The agent was a model citizen in dry-run and a menace the day we flipped the flag, because dry-run had quietly been an easier game.)
The audit trail is the product
Notice that the wrapper called audit.record before any branch decided anything. That ordering is deliberate. You log the proposal and the decision first, then you act, so that if the act crashes mid-flight you still have a record of intent. An audit trail that only captures successes is a trail that goes dark exactly when you need it.
What goes in it is every tool call, with its inputs, the policy decision, who or what approved it, and the idempotency key that ties a proposal to its eventual effect. When something goes wrong at two in the morning, and it will, the question is always the same: what did the agent try to do, what stopped it or didn’t, and why. If you cannot answer that from the log in one pass, you do not have an audit trail, you have a hope.
The trail is also what lets you raise the dial honestly. Every limit I have ever relaxed, I relaxed because the log told me a story I could defend to someone whose money was on the line. Reads first, then proposals, then small auto-approved actions, then larger ones, each step backed by a pile of recorded decisions that went the way they should have.
An agent that can read everything and propose anything and commit nothing without clearing a gate is not a crippled agent. It is the only kind I would put near a ledger. The autonomy comes later, one click at a time, and every click has a paper trail behind it. If you cannot point at the evidence for the last click, you turned the dial too far.