A finance team I worked with had receivables in the tens of millions sitting in a state nobody wanted to name. Not bad debt. Not disputed. Just stuck. The cash had arrived, or some of it had, and nobody could line up which payment settled which invoice without a person opening six tabs and reading a remittance advice that a human in another company had typed into the body of an email.
That is the actual problem. Not the model, not the agents, not the retrieval. The problem is that the truth about a single payment is scattered across an invoice in one system, a contract clause in a PDF, a bank remittance file in a fixed-width format from 1998, and a one-line email that says “paid the March stuff less the credit, see attached.” Reconciliation is a reading comprehension task that finance has been doing by hand because the data was never clean enough to automate the dumb way.
So we built something to read it. RAG over the documents, anomaly detection to find the mismatches, an agent to propose a resolution and draft the cash-flow note, and a human who approves anything that touches money. The interesting part is not the parts that sound interesting.
The shape of it
Everything left of the approval gate is reversible and cheap. Everything right of it moves money, so a person signs.
Two ledgers disagree: what we invoiced and what the bank says cleared. A normalization job lifts every payment, invoice, contract, and remittance email into one store with consistent keys. A matcher tries the easy cases first (exact reference, exact amount) and routes the rest to retrieval. For a stuck item, the agent pulls every document that plausibly touches that customer and that amount, reasons about what reconciles to what, and writes a proposed resolution plus a plain-language narration of the cash effect. Then it stops. A human reads the proposal, sees the evidence inline, and approves or rejects. Only on approval does anything post to the ledger.
The agent never moves money. It writes a recommendation and shows its work. That single rule decided most of the architecture.
What was genuinely hard
The remittance data. I cannot overstate this. The model that reads a contract clause and decides whether a two percent early-payment discount applies is doing something almost easy compared to parsing what a customer’s accounts-payable clerk actually sent us. We saw remittance advice as PDF tables, as scanned images of PDF tables, as Excel attachments with merged cells, as plain text in an email body, and in one memorable case as a photo of a printout taken at an angle. The information was all there. It was just never in the same place twice.
RAG earned its keep here, but not the textbook version. Chunking a contract is fine. Chunking a remittance email is a trap, because the meaning lives in the relationship between the line items, and if you split them you lose the thing you came for. So remittances do not get chunked. Each one becomes a single structured record with the raw text attached, and retrieval pulls the whole record or none of it. The retrieval index is over the messy stuff (emails, contract clauses, dispute notes), and the structured ledgers stay in Postgres where joins and money belong. Embeddings are for finding the relevant paragraph. They are not for deciding what equals what. The moment a number matters, it comes from the database, not the vector.
The second hard thing was matching across systems that were never designed to be matched. The invoice ID in our ledger is not the reference the customer put on their wire. The customer batched four invoices into one payment, applied a credit against one of them, and short-paid another over a delivery dispute nobody logged. A naive matcher gives up. The agent’s job is to hold the partial evidence and propose the most likely allocation, with a confidence and a reason, the way a senior person in the finance team would. The win was not getting it right every time. The win was getting the obvious cases right cheaply and handing the human a clean starting point on the hard ones instead of a blank screen.
Here is the part nobody tells you. The hard cases are not where the value is. The value is the long tail of boring matches that were stuck only because no human had time to open the tabs. The agent clears those in bulk. The genuinely ambiguous allocations still go to a person, and they should, because those are the ones where being confidently wrong costs you a customer.
What was boring but essential
Idempotency. Every proposal carries a deterministic key derived from the items it touches: the invoice IDs, the payment ID, the resolution type. Run the agent twice on the same stuck item and you get the same proposal key, and the system refuses to create a second one. This sounds like a footnote. It is the thing that lets you re-run the whole pipeline after a crash, or after you improved a prompt, without doubling up on resolutions or, worse, double-posting to a ledger. An agent that loops and calls tools will get interrupted. Plan for the retry, not the happy path.
def resolution_key(proposal):
# Same stuck item + same fix must always hash the same, so a retry
# (or a re-run after we tweak the prompt) can't create a second
# proposal or, god forbid, post the cash twice.
parts = (
proposal.customer_id,
tuple(sorted(proposal.invoice_ids)),
proposal.payment_id,
proposal.resolution_type, # apply | short_pay | credit | writeoff
round(proposal.amount_cents), # cents, never floats near money
)
return hashlib.sha256(repr(parts).encode()).hexdigest()
def propose(stuck_item, evidence):
draft = agent.draft_resolution(stuck_item, evidence)
key = resolution_key(draft)
existing = store.get_proposal(key)
if existing:
return existing # already proposed; do not duplicate
return store.put_proposal(key, draft, status="awaiting_approval")
The other boring essential is the audit trail, and in regulated finance it is not optional. Every proposal records exactly which documents the agent retrieved, the matcher’s score, the model’s stated reasoning, who approved it, and when. Not a log line. A first-class record you can put in front of an auditor and a regulator. When the agent proposes applying a payment across three invoices with a credit, the approver sees the remittance email, the contract clause that justifies the credit, and the model’s allocation, all on one screen. The human is not rubber-stamping a black box. They are checking a worked answer against the evidence the system already pulled.
That is the whole design philosophy, really. The agent does the reading and the arithmetic of finding candidates. The human keeps the judgment and the signature. Drawing that line in the right place is most of the work.
def post_resolution(key, approver):
proposal = store.get_proposal(key)
if proposal.status != "awaiting_approval":
raise StaleProposal(key) # someone got here first
if not approver.can_post(proposal.amount_cents):
raise NeedsHigherApproval(proposal.amount_cents)
with ledger.transaction() as txn:
txn.apply(proposal.entries, idempotency_key=key) # key again here
audit.record(
proposal=proposal,
retrieved_docs=proposal.evidence_ids, # what the agent read
model_reasoning=proposal.reasoning,
approved_by=approver.id,
)
store.mark_posted(key)
What I would tell someone building this
Do not start with the agent. Start with the boring half. Get the normalization, the idempotency keys, and the audit record right while the matcher is still dumb if-statements, because those are the parts that are expensive to retrofit and impossible to fake in front of a regulator. The agent is the easy thing to add last, on top of a system that already cannot double-post and already remembers everything it did.
And resist the urge to let the agent close the loop. Every quarter someone asks why a person still has to approve, since the agent is right most of the time. Most of the time is exactly the problem. The cases where it is wrong are the cases where money goes to the wrong place and a customer relationship pays for it. The human in the loop is not there because the agent is bad. The human is there because the downside is asymmetric, and no eval number I can show you changes that math.