Why Your RAG System Works in the Demo and Fails in Production
The gap between a RAG demo that wows a room and a RAG system that survives real users is almost never the model. It's retrieval, evaluation, and the boring data plumbing nobody budgets for.
Every team we meet has the same story: a retrieval-augmented generation prototype that answered ten hand-picked questions flawlessly, then quietly fell apart the week it touched real users. The model didn't get dumber. The conditions changed, and the demo was never testing the things that actually break.
The demo is a rigged game
A demo is a curated environment. You picked the documents, you picked the questions, and you subconsciously phrased those questions using the same vocabulary that appears in the source text. Of course retrieval works. The embedding of "What is our refund window?" sits right next to a paragraph that literally says "refund window." Cosine similarity has an easy day.
Production is an adversarial environment by accident. Users ask "can I get my money back after 40 days" when the policy never uses the word "money" or the number 40. They paste error logs, they ask multi-part questions, they reference "the thing we discussed last quarter." The retrieval step that looked solid was solving a much easier problem than the one your users bring.
Retrieval is the bottleneck, not generation
Most teams obsess over the LLM and prompt, then treat retrieval as a solved library call. It is the opposite. If the right chunk isn't in the context window, no amount of prompt engineering or model upgrades will save you. The model can only reason over what you hand it, and a confident answer built on the wrong three chunks is worse than no answer.
The failure modes are specific and measurable. Watch for these:
- Chunking that splits a table from its header, or an answer from its qualifying sentence, so the retrieved fragment is technically relevant but semantically incomplete.
- Pure vector search missing exact-match terms — product SKUs, error codes, person names — where a keyword (BM25) index would have nailed it. Hybrid search is not optional for most real corpora.
- Top-k set too low, so the answer exists in chunk 6 but you only passed chunks 1 through 4.
- Embedding a 2000-token chunk and a 6-word query into the same space and expecting them to be comparable. They aren't, really.
You shipped without an eval set, so you're flying blind
The single biggest difference between teams whose RAG works and teams whose RAG embarrasses them is whether they have a labeled evaluation set. Not vibes. A few hundred real questions paired with the documents that should answer them and a notion of the correct answer.
With that set you can measure the two things that matter separately: retrieval quality (did the right chunk make it into context — recall@k, MRR) and answer quality (given the right context, did the model answer correctly and without inventing). When something regresses, you immediately know which half broke. Without it, every bug report is a guessing game and every "fix" is a coin flip that might make things worse somewhere you aren't looking.
Hallucination is a retrieval symptom as often as a model one
Teams reach for "the model hallucinates" as a diagnosis when the real story is that retrieval returned weak or empty context and the model gamely filled the vacuum. An LLM handed irrelevant chunks will still try to be helpful — that helpfulness is your hallucination.
The fixes are mostly upstream of the model. Give it an explicit escape hatch: instruct it to answer only from the provided context and to say it doesn't know otherwise, and actually reward that behavior in your evals. Add a relevance gate that drops chunks below a similarity threshold so you'd rather pass nothing than garbage. And ground answers with citations back to source chunks, both so users can verify and so you can audit which retrievals led to which claims.
The data is alive and your index is a snapshot
Demos use a frozen corpus. Production data mutates: docs get edited, policies change, new products launch, old pages get deprecated. The day your knowledge base says one thing and your vector index still remembers last month's version, your RAG system is confidently wrong with full citations — the worst possible failure because it looks trustworthy.
Treat the index as a pipeline, not a one-time job. You need incremental re-indexing on document change, a strategy for deletes (a removed doc must leave the index, not linger), and metadata like timestamps and source so you can filter stale or out-of-scope content. This is unglamorous data engineering, and it is where most production RAG systems silently rot.
What to actually build first
Build the evaluation harness before you tune anything. Then make retrieval hybrid and observable — log every query, the chunks it returned, their scores, and the final answer, so you can replay failures instead of reproducing them by hand. Add the relevance gate and citations. Only after all that does it make sense to argue about which model or chunk size is best, because now you can prove it.
None of this is exotic. It's just the part that doesn't demo well, so it gets deferred — and deferring it is exactly why the thing that wowed the room falls over in week one.
The bottom line: a RAG demo tests whether your model can talk; production tests whether your retrieval, evaluation, and data pipeline can keep it honest. Build the eval set first, make retrieval hybrid and observable, give the model permission to say "I don't know," and treat your index as a living pipeline. The model was never the hard part.