Why your RAG system isn't shipping (and what's wrong with the eval)

The two-day demo

A RAG demo takes about two days for a competent engineer. Pick a vector database, embed your documents with text-embedding-3-large, retrieve top-k, stuff into the prompt, generate an answer. The demo answers questions. Stakeholders are pleased. Roadmaps are drawn.

Three months later, the system is not in production. The most common reasons we see — and what to fix before the next attempt.

1. Your eval is anecdotal

The hardest part of shipping RAG is not the architecture; it is convincing yourself the system is good enough. Most teams stop at "we tried it on a few questions and it answered well." That is not an eval.

A working eval has:

A golden set — questions with known answers, curated by domain experts, covering the actual distribution of queries you expect in production.
Retrieval metrics — hit rate at k, MRR, NDCG against the golden set.
Generation metrics — faithfulness, answer relevancy via LLM-as-judge.
CI integration — every PR runs the eval, regressions block merge.

If you cannot block a deploy on a number, you do not have an eval. You have a demo.

2. Citations are decorative

Production RAG must cite. Not as a feature — as a hard constraint that the prompt enforces and the eval checks. If your system outputs an answer without a source span, that is a failure mode you log, not a "minor edge case."

Citation enforcement is a prompt-layer concern, not a UI-layer concern. The model emits citations, the post-processor validates that every claim has a source span in the retrieved chunks, and the system refuses to return answers that fail validation.

3. The vector database was chosen for the demo

Picking Pinecone because the docs are friendly is reasonable for a demo. It is not reasonable for production at the bank's compliance posture.

The right vector database is the one your operations team can run, with the filtering capability your queries require, at the scale your corpus demands. For most enterprise builds we ship today, that is Qdrant. For teams that already run Postgres at scale and have a corpus under ~10M chunks, it is pgvector. Pinecone is sometimes the right call when the team already pays for it and migration cost outweighs the benefit.

4. There is no reranker

Vector retrieval alone is not good enough for production RAG. Cross-encoder reranking lifts precision at the top of the result set in ways that retrieval scores can't approximate. Skipping the reranker is the single most common cause of "the system retrieves the right document but the model picks the wrong chunk."

This is non-negotiable in our builds. Cohere Rerank, BGE-reranker, or fine-tuned cross-encoders — the choice is engineering, but a reranker is in the pipeline.

5. Observability stops at "the API responded"

When the answer is wrong, you need to walk back through the trace: what was the query, what was retrieved, what were the rerank scores, what did the model see, what did it generate, what did it cite. If your observability stops at HTTP-level traces, debugging goes from minutes to days.

We default to Langfuse or a custom OTEL pipeline that captures every retrieval, every rerank, every generation, every cost.

The checklist

Before the next production attempt, every box ticked:

[ ] Golden set exists, curated by domain experts.
[ ] Retrieval evals (hit rate, MRR, NDCG) run in CI.
[ ] Generation evals (faithfulness, relevancy) run in CI.
[ ] Citation enforcement at the prompt layer, validated at post-processing.
[ ] Vector database chosen for production constraints, not demo experience.
[ ] Cross-encoder reranker in the pipeline.
[ ] Per-query trace observability with retrieval scores and costs.

If any box is empty, the system is not ready for the production review. The fastest path back to ready is usually a focused engagement — discovery first, then build.