Skip to content
Ryan de Melo
Go back

Your RAG Is Bad Because Your Chunking Is Bad

A year ago I wrote that bad answers are a retrieval problem wearing a model costume. I stand by it more now than I did then. What I got wrong was thinking retrieval was mostly about the search. It is mostly about what you put in the index, and most of you are putting in garbage.

I have spent the last year watching teams ship RAG, get mediocre answers, and reach for the same lever every time: a bigger model. GPT-4 instead of 3.5. A reranker. A fancier prompt with seven instructions about citing sources. None of it moves the needle much, because the model is reading chunks that were never going to answer the question. You can’t reason your way out of a context window full of half-sentences and orphaned table rows.

Here is the part nobody tells you. The single biggest quality jump I have shipped in the last year was not a model swap or a reranker. It was deleting my character-count splitter and writing one that respected the document.

What naive chunking actually does to a document

Take a typical internal doc. A policy page with an H2 heading, three paragraphs under it, then a table of fee tiers. Now run the splitter every tutorial hands you: fixed 1,000-character windows, maybe 200 of overlap.

Watch what happens. The heading “Refund eligibility by tier” lands at the tail of chunk three, and the table it introduces starts in chunk four with no heading attached. The table itself gets sliced between rows, so chunk four ends with Gold | 14 days | and chunk five opens with 30 days | no fee, the column names three chunks back and gone. Someone asks “how long do Gold customers have to request a refund,” the embedding for that question matches the heading chunk, which no longer contains the answer, and the model confidently tells them something it half-invented.

That is not a model failure. You handed it a riddle.

Fix one: split on structure, then size

I split on the document’s own boundaries first, and only pack to a size budget second. Headings stay attached to the section they head. Tables stay whole. List items don’t get cut mid-thought. This is the same instinct as the hand-rolled splitter from last year, but a year of production taught me the boundaries matter more than the byte budget, so I split harder on structure and I let chunks vary in size.

def structure_split(doc, soft_cap=1200, hard_cap=2000):
    # doc is already parsed into typed blocks: ("heading", text, level),
    # ("para", text), ("table", rows), ("list", items). Parse FIRST.
    # I would rather have a 1800-char chunk that holds a whole table
    # than a tidy 1000-char one that cuts the table in half.
    chunks, buf, size, current_heading = [], [], 0, None

    def flush():
        if buf:
            body = "\n".join(buf)
            # prepend the section heading to every chunk under it, so the
            # heading's meaning rides along even if it's three chunks deep
            text = f"{current_heading}\n\n{body}" if current_heading else body
            chunks.append(text)

    for block in doc:
        kind = block[0]
        if kind == "heading":
            flush(); buf, size = [], 0
            current_heading = block[1]
            continue
        rendered = render_block(block)        # tables -> markdown, lists -> bullets
        # a table or huge block goes in whole, even past the soft cap
        if kind == "table" or len(rendered) > hard_cap:
            flush(); buf, size = [], 0
            chunks.append(f"{current_heading}\n\n{rendered}" if current_heading else rendered)
            continue
        if size + len(rendered) > soft_cap and buf:
            flush(); buf, size = [], 0
        buf.append(rendered); size += len(rendered)
    flush()
    return chunks

The two things that earned their keep here: every chunk carries its section heading, and tables go in as one unit. Before this, “how long do Gold customers have to request a refund” pulled back the heading chunk and a stranded row, and the answer was wrong about half the time on my eval set. After, the same question pulls one chunk that holds the heading and the full fee table, and it answers correctly nearly every time. Same model. Same prompt. Same embeddings. I just stopped shredding the documents.

Fix two: metadata is for filtering, not decoration

Most teams store a chunk and its vector and nothing else. Then they wonder why a question about the 2024 fee schedule pulls back a chunk from the 2021 one that happens to embed close. Embeddings don’t know what year it is. They don’t know which product line, which region, which document is superseded. You have to tell them, and the place to tell them is metadata you can filter on before the vector search ever runs.

def index_chunk(text, source_doc):
    embed_and_store(
        text=text,
        vector=embed(text),
        meta={
            "doc_id":      source_doc.id,
            "doc_type":    source_doc.type,          # "policy", "runbook", "faq"
            "product":     source_doc.product,
            "region":      source_doc.region,        # filter SG users off ID docs
            "effective":   source_doc.effective_date,
            "superseded":  source_doc.superseded,    # never retrieve a dead version
            "heading":     source_doc.current_heading,
        },
    )

# query time: cut the haystack BEFORE you search it
def retrieve(question, ctx, k=6):
    pre = {"superseded": False, "region": ctx.region}
    if ctx.product:
        pre["product"] = ctx.product
    return index.search(embed(question), k=k, where=pre)

A metadata filter is the cheapest recall win in RAG and almost nobody bothers. Cutting superseded documents alone removed a whole class of confidently-wrong answers, the kind where the tool cites a real document that stopped being true eighteen months ago. (The effective date also lets me sort ties toward the newer doc, which matters more than it sounds.)

Fix three: parent/child retrieval

There is a real tension in chunk size. Small chunks embed precisely, because a tight chunk about one thing matches a question about that thing cleanly. Big chunks give the model room to actually reason, because the answer usually needs the surrounding paragraph, not just the matching sentence. You want both, and you can have both.

Embed small. Return big. Index a precise child chunk for matching, but keep a pointer to its parent section, and when a child hits, hand the model the parent.

def retrieve_parents(question, ctx, k=8):
    children = retrieve(question, ctx, k=k)
    # dedupe up to parents: five matching children from one section
    # should give the model that section once, not five overlapping slivers
    parent_ids, seen = [], set()
    for c in children:
        pid = c.meta["parent_id"]
        if pid not in seen:
            seen.add(pid); parent_ids.append(pid)
    return [parent_store[pid] for pid in parent_ids[:4]]

This fixed a failure I kept seeing where the right answer existed but the model couldn’t assemble it, because the matching child chunk had the fact and none of the qualifying sentence next to it. “Refunds within 14 days” is true and useless without the next line that says “for unused services only.” The child matched. The parent carried the caveat.

The boring parts that actually moved quality

None of this is clever. Parse the document into real blocks before you split, so the splitter knows what a heading and a table are (this is most of the work, and it is genuinely tedious). Keep tables whole. Glue headings to their sections. Store metadata you can filter on. Embed small, return big. That’s the list.

I spent the early part of last year assuming the next model would fix my retrieval. It didn’t, and it won’t, because the next model is still going to read whatever you put in front of it. The leverage is upstream, in the unglamorous parsing and splitting nobody demos. Bigger models made bad chunking cheaper to tolerate. They did not make it stop being the reason your RAG is bad.

So before you open the pricing page to budget for the bigger model, go read ten chunks your system actually retrieved last week. Read them the way the model has to, with no memory of the document they came from. If you can’t answer the question from those ten chunks, neither can it. Fix that first.


Share this post:

Previous Post
Evals Are the New Unit Tests (And You're Not Writing Them)
Next Post
Five Hundred Engineers, Four Countries, and Conway's Law