Building a RAG Pipeline Before LangChain Was Cool

Retrieval, not the model, decides whether a RAG system is any good. The demo took an afternoon; the production system took four months, and that gap is the whole story of retrieval-augmented generation in 2023. Most of the people posting screenshots have not crossed it yet.

I have been building an internal answer tool over a few hundred thousand documents: policies, runbooks, support history, the institutional memory that usually lives in three people’s heads. The pitch writes itself: ask a question in plain language, get an answer grounded in the actual documents, with citations. Everyone in the room nods, and building it is where you find out the language model was never the hard part.

When the answers are wrong, the instinct is to blame the model, swap in the bigger one, and wait for the bill. Nine times out of ten the model did its job: it answered the question using the context it was handed, and the context was garbage. Bad answers are a retrieval problem wearing a model costume.

The shape of it

There is no framework I trust enough yet to hide this from me, so I wrote it out by hand: two paths that share one index.

RAG pipeline, hand-rolled in 2023: an offline ingest job and an online query path sharing one vector store

The offline path ingests documents, cleans them, splits them into chunks, embeds each chunk, and writes the vectors to a store. The online path embeds the incoming question, pulls the nearest chunks, stuffs them into a prompt, and asks the model to answer using only what it was given. Both paths embed text the same way and point at the same index. If they ever drift apart, retrieval rots and you find out from an angry user, not a test.

Chunking is the whole game

The naive move is to split every document into fixed 1,000-character windows and call it done. That works in the demo and falls apart on real documents, where a fixed window cuts sentences in half, separates a heading from the section it introduces, and strands a table’s numbers from the column names that give them meaning.

What worked was splitting on structure first and size second. Respect the document’s own boundaries, then pack up to a budget, and let consecutive chunks overlap a little so an idea that straddles a boundary survives in at least one piece.

def chunk(blocks, target=900, overlap=150):
    # blocks = the document already split on its own structure:
    # headings, paragraphs, list items. We pack those, we don't shred them.
    chunks, buf, size = [], [], 0
    for block in blocks:
        if size + len(block) > target and buf:
            chunks.append(" ".join(buf))
            # carry the tail forward so a thought spanning a boundary
            # still lands intact in the next chunk
            tail, kept = [], 0
            for b in reversed(buf):
                if kept + len(b) > overlap:
                    break
                tail.insert(0, b); kept += len(b)
            buf, size = tail, kept
        buf.append(block); size += len(block)
    if buf:
        chunks.append(" ".join(buf))
    return chunks

That overlap looks like a rounding detail, and it moved answer quality more than any prompt I ever wrote. (I spent two weeks tuning prompts before I admitted the prompts were fine.)

Retrieval, and what nearest-neighbor won’t tell you

Embed the chunks, store the vectors, and at query time pull the closest ones by cosine similarity; that part is standard. The trap is treating the top-k list as truth, because vector search always returns something. Ask it about a topic you have no documents on and it will hand back your five least-irrelevant chunks with total confidence, and the model will dutifully write an answer on top of nothing.

My defense is to keep the similarity scores and refuse to pass context that clears no bar.

def retrieve(question, k=6, floor=0.78):
    q = embed(question)
    hits = index.search(q, k=k)              # [(chunk_id, score), ...]
    keep = [h for h in hits if h.score >= floor]
    if not keep:
        # better to say we don't know than to ground a confident answer
        # in our five least-irrelevant paragraphs
        return None
    return [store[h.chunk_id] for h in keep]

When retrieve returns nothing, the tool says it has no good source and stops. Users forgive “I don’t have a confident answer for that,” but they will not forgive a fluent, cited, completely wrong one. That second kind erodes trust in everything else the tool says, including the answers that were right.

What I would tell someone starting today

You do not need a dedicated vector database to begin. A few hundred thousand chunks sit comfortably in Postgres with a vector extension, and staying there meant one fewer system to operate while I was still learning what the workload even was. Move when the numbers tell you to move, not when a launch post does.

Spend your time on evals before you spend it on models. I kept a few hundred real questions with answers I trusted, and every change ran against them. Without that harness, “it feels better” is the only feedback you get, and it lies; with it, I could watch a chunking tweak lift answer quality more than swapping to the newest model did, at none of the cost.

Write it by hand the first time anyway. The frameworks will get good, and when they do, I will understand what they are doing for me and, more usefully, what they are quietly doing to me.