Agentic Workflows Need Guardrails, Not Vibes

The log showed a clean run: the agent found a customer, matched an invoice, called the refund tool, and reported success. Every line of it was true, and the money had still gone to the wrong account. The only reason it never reached production is that the staging run pointed at a sandbox ledger. Each step was reasonable on its own, but the composition was wrong, and nothing in the system was built to catch a wrong composition of right steps.

That is the whole problem with agents that act on real systems. Each individual decision looks defensible in the trace. The failure lives in the sequence, and the sequence is what you handed to a model that will, a few percent of the time, do something fluent and confident and wrong.

When people ask how I let an agent touch money, the honest answer is that I don’t. I let it ask. Autonomy is a dial rather than a switch, and you earn each click of it with evidence instead of optimism.

The tool surface is the blast radius

The first mistake I see is handing an agent a client object. Someone wraps the whole payments SDK, or the whole internal API, gives the agent the credentials, and writes a system prompt that says “only issue refunds when appropriate.” That prompt is not a control, just a suggestion aimed at a system that does not reliably follow suggestions.

The tool surface you expose is the maximum damage the agent can do, full stop. If delete_customer is reachable, assume it will eventually get called on the wrong customer. The fix is boring and it works: expose an allowlist of narrow, single-purpose tools, not a transport to the underlying API.

# The agent gets THESE four functions. Not the payments client.
# Every tool here is single-purpose and validates its own inputs,
# because the model is not the thing I trust to validate them.

AGENT_TOOLS = {
    "lookup_invoice": lookup_invoice,        # read-only
    "lookup_customer_balance": lookup_balance, # read-only
    "propose_refund": propose_refund,         # writes a PROPOSAL, not a refund
    "list_refund_reasons": list_reasons,      # read-only
}

# Notice what is missing: there is no issue_refund, no charge_card,
# no adjust_ledger. The agent literally cannot name those operations.
# Irreversible actions don't live in the agent's vocabulary at all.

The agent can read freely, because a wrong read costs little. The one write it can reach does not move money, it creates a proposal. Actually moving money happens behind a gate the agent cannot open on its own, and that gate is the next piece.

A proposal is not an action

Approval gates get misread as being about a human clicking yes. What they actually buy you is the agent committing to a fully specified action before anyone, human or policy, decides on it. A vague intent gives a reviewer nothing to check; a concrete proposal gives them everything.

The irreversible path always runs in two beats. First the agent produces a proposal: a typed, complete description of what it wants to happen, with the exact amount, the exact account, the reason, and the idempotency key already chosen. A separate checkpoint then decides whether that specific proposal proceeds, and the agent never holds both halves.

from dataclasses import dataclass
from decimal import Decimal

@dataclass(frozen=True)
class RefundProposal:
    invoice_id: str
    amount: Decimal
    currency: str
    reason: str
    idempotency_key: str   # chosen at proposal time so a retry can't double-pay

def with_checkpoint(action_name, policy):
    """Wrap an irreversible action so it cannot fire without clearing policy.
    The agent calls the wrapped version. It never sees the raw effect."""
    def wrapper(proposal, ctx):
        decision = policy.evaluate(action_name, proposal, ctx)

        audit.record(action_name, proposal, decision, ctx)  # before anything fires

        if decision.outcome == "auto_approved":
            return _commit(action_name, proposal, ctx)

        if decision.outcome == "needs_human":
            # Park it. The agent run ENDS here for this branch.
            # A human resolves it out of band; we don't block a token loop
            # waiting on a person to wake up.
            return parked(decision.review_id, proposal)

        # Default is no. An action that isn't explicitly allowed is denied,
        # not allowed-because-nothing-said-no.
        return denied(decision.reason)
    return wrapper

The default has to be deny: if the policy has nothing to say about a proposal, the proposal does not proceed. I have watched teams build the inverse, where the gate only stops things it was told to stop, and every new action type ships wide open until someone notices, all because silence was allowed to mean yes.

What the policy actually checks

The policy is where your real rules live, and they are dumber than people expect. Thresholds, allowlists, rate limits. You do not need a model to decide whether to escalate; you need a model to do the messy work of reading invoices, and arithmetic to decide whether a human looks at the result.

def evaluate(self, action, proposal, ctx):
    # Small, reversible-ish amounts the agent has earned the right to auto-run.
    if proposal.amount <= self.auto_limit and ctx.refunds_today < self.daily_cap:
        return Decision("auto_approved")

    # Anything above the line, or anything touching an account that's
    # been flagged, goes to a person. This is not the agent's call.
    if proposal.amount > self.auto_limit or ctx.account_flagged:
        return Decision("needs_human", review_id=open_review(proposal))

    # Tripped a rate limit? Stop. A correct agent doesn't issue forty
    # refunds in a minute; a looping one does, and that's exactly when
    # you want the brakes to be mechanical, not advisory.
    return Decision("denied", reason="rate_limit")

The auto-approve limit starts at zero. On day one every proposal goes to a human, and you watch. When you have a few hundred proposals the policy auto-approved in a dry run and a human agreed with every one, you raise the limit a notch. Earning a click of the dial is exactly this: a stack of decisions you checked, not a vibe that the agent “seems reliable now.”

Dry-run is not a testing afterthought

Every tool that has an effect gets a simulate mode wired in from the start, not bolted on later. In simulate mode the proposal runs the full path, policy and all, and the commit step returns what it would have done instead of doing it. This is how you let an agent run a thousand times against real data and produce zero real effects, which is the only honest way to measure it before you trust it.

def _commit(action, proposal, ctx):
    if ctx.mode == "simulate":
        # Full path, real validation, real policy, no money moves.
        # The output is identical in shape to a real commit so the
        # agent can't tell the difference and behave differently.
        return SimulatedResult(action, proposal, would_have="committed")
    return effects[action](proposal)  # the real, irreversible thing

This works only because the agent gets back a result identical in shape to the real one. If simulate returns a different structure, the agent learns to behave differently in simulation than in production, and your whole evaluation is measuring a system you will never actually run. (I have made this mistake: the agent was a model citizen in dry-run and a menace the day we flipped the flag, because dry-run had been an easier game all along.)

The audit trail is the product

The wrapper calls audit.record before any branch decides anything, and the order matters. You log the proposal and the decision first, then you act, so that if the act crashes mid-flight you still have a record of intent. An audit trail that only captures successes goes dark exactly when you need it.

What goes in it is every tool call, with its inputs, the policy decision, who or what approved it, and the idempotency key that ties a proposal to its eventual effect. When something goes wrong at two in the morning, and it will, the question is always the same: what did the agent try to do, what stopped it or didn’t, and why. If you cannot answer those three from the log in one pass, the log is decoration and you are flying on hope.

The trail is also what lets you raise the dial honestly. Every limit I have ever relaxed, I relaxed because the log told me a story I could defend to someone whose money was on the line. Reads first, then proposals, then small auto-approved actions, then larger ones, each step backed by a pile of recorded decisions that went the way they should have.

An agent that can read everything, propose anything, and commit nothing without clearing a gate is the only kind I would put near a ledger, and nothing about that feels crippled to me. The autonomy comes later, one click at a time, each click with a paper trail behind it. I do not turn the dial past the last click I can still point at the evidence for; once I cannot, it has already gone too far.