The demo took an afternoon. The production system took four months. That gap is the whole story of retrieval-augmented generation in 2023, and most of the people posting screenshots have not crossed it yet.
I have been building an internal answer tool over a few hundred thousand documents: policies, runbooks, support history, the institutional memory that usually lives in three people’s heads. The pitch writes itself. Ask a question in plain language, get an answer grounded in the actual documents, with citations. Everyone nods in the room. Then you go build it and discover that the language model was never the hard part.
Here is the part nobody tells you. When the answers are wrong, the instinct is to blame the model, swap in the bigger one, and wait for the bill. Nine times out of ten the model did its job. It answered the question using the context it was handed, and the context was garbage. Bad answers are a retrieval problem wearing a model costume.
The shape of it
There is no framework I trust enough yet to hide this from me, so I wrote it out by hand. Two paths that share one index.
The offline path ingests documents, cleans them, splits them into chunks, embeds each chunk, and writes the vectors to a store. The online path embeds the incoming question, pulls the nearest chunks, stuffs them into a prompt, and asks the model to answer using only what it was given. Both paths embed text the same way and point at the same index. If they ever drift apart, retrieval quietly rots and you find out from an angry user, not a test.
Chunking is the whole game
The naive move is to split every document into fixed 1,000-character windows and call it done. It works in the demo. It falls apart on real documents, because a fixed window cuts sentences in half, separates a heading from the thing it heads, and strands a table’s numbers from the column names that give them meaning.
What worked was splitting on structure first and size second. Respect the document’s own boundaries, then pack up to a budget, and let consecutive chunks overlap a little so an idea that straddles a boundary survives in at least one piece.
def chunk(blocks, target=900, overlap=150):
# blocks = the document already split on its own structure:
# headings, paragraphs, list items. We pack those, we don't shred them.
chunks, buf, size = [], [], 0
for block in blocks:
if size + len(block) > target and buf:
chunks.append(" ".join(buf))
# carry the tail forward so a thought spanning a boundary
# still lands intact in the next chunk
tail, kept = [], 0
for b in reversed(buf):
if kept + len(b) > overlap:
break
tail.insert(0, b); kept += len(b)
buf, size = tail, kept
buf.append(block); size += len(block)
if buf:
chunks.append(" ".join(buf))
return chunks
That overlap looks like a rounding detail. It was worth more to answer quality than any prompt I wrote. (I spent two weeks tuning prompts before I admitted the prompts were fine.)
Retrieval, and the thing nearest-neighbor won’t tell you
Embed the chunks, store the vectors, and at query time pull the closest ones by cosine similarity. Standard. The trap is treating the top-k list as truth. Vector search always returns something. Ask it about a topic you have no documents on and it will hand back your five least-irrelevant chunks with total confidence, and the model will dutifully write an answer on top of nothing.
So I keep the similarity scores and refuse to pass context that clears no bar.
def retrieve(question, k=6, floor=0.78):
q = embed(question)
hits = index.search(q, k=k) # [(chunk_id, score), ...]
keep = [h for h in hits if h.score >= floor]
if not keep:
# better to say we don't know than to ground a confident answer
# in our five least-irrelevant paragraphs
return None
return [store[h.chunk_id] for h in keep]
When retrieve returns nothing, the tool says it has no good source and stops. Users forgive “I don’t have a confident answer for that.” They do not forgive a fluent, cited, completely wrong one. The second kind erodes trust in everything else the tool says, including the answers that were right.
What I would tell someone starting today
You do not need a dedicated vector database to begin. A few hundred thousand chunks sit comfortably in Postgres with a vector extension, and staying there meant one fewer system to operate while I was still learning what the workload even was. Move when the numbers tell you to move, not when a launch post does.
Spend your time on evals before you spend it on models. I kept a few hundred real questions with answers I trusted, and every change ran against them. Without that, “it feels better” is the only feedback you get, and it lies. With it, I could see a chunking tweak lift answer quality more than swapping to the newest model did, at none of the cost.
And write it by hand the first time. The frameworks will get good. When they do, you will actually understand what they are doing for you, and more usefully, what they are quietly doing to you.