Building RAG systems that actually work in production

Key takeaways

RAG fails at retrieval, not generation - Up to 70% of production RAG systems fail because they retrieve poor quality documents, not because the language model is inadequate
Chunking strategy determines retrieval quality - How you break documents into pieces has more impact than which embedding model you choose, with semantic chunking outperforming fixed-size approaches by 40%
Hybrid search beats pure semantic search - Combining keyword search with semantic search catches exact matches that embeddings miss, especially for technical terms and code
Measure retrieval before generation - Track precision and recall on retrieved documents separately from response quality to identify where systems actually break
Need help implementing these strategies? Let's discuss your specific challenges.

Everyone building RAG system starts with the language model. Wrong place to start.

The model can only work with what you retrieve. I keep seeing teams spend weeks fine-tuning prompts and tweaking generation parameters when their real problem is simpler: they’re feeding the model garbage. Research from teams at Docker, CircleCI, and Reddit found that data quality problems account for most production failures, not model limitations.

Your RAG system is a retrieval system first, a generation system second.

Why most RAG fails at retrieval

The first mistake when building rag system: focusing on the language model instead of document quality. You chunk your documents without thinking about semantic boundaries. You embed them with whatever model is popular. You throw them in a vector database and hope similarity search finds the right stuff.

It doesn’t.

Up to 70% of RAG systems fail in production despite working fine in demos. The difference between demo and production? Your demo uses clean, carefully formatted test documents. Production data is messy. PDFs with tables. Legal documents with nested clauses. Code with inconsistent formatting. Support tickets with terrible grammar.

When you apply simple fixed-size chunking to this reality, you get nonsensical pieces. A chunk that starts mid-sentence and ends mid-thought. A table split across three chunks with no context. Critical information separated from the question it answers.

The embedding model can’t save you from bad chunks. Garbage in, garbage out holds especially true for RAG.

The chunking decision that matters most

Research on chunking strategies shows semantic chunking outperforming fixed-size approaches by roughly 40% in retrieval accuracy. But semantic chunking takes more work.

You need to understand document structure. Respect paragraph boundaries. Keep related concepts together. Preserve enough context so each chunk makes sense alone.

Start with roughly 250 tokens per chunk - that’s about 1000 characters. Not because this is optimal, but because it’s a sensible baseline for testing. Too small and you lose context. Too large and you retrieve irrelevant information alongside what you actually need.

More important than chunk size: preserve semantic completeness. A chunk should contain a complete thought. If you’re chunking technical documentation, break at section boundaries. If you’re processing legal documents, respect clause structure. For support tickets, keep the question and answer together.

Teams building production RAG systems learned this the hard way. They started with simple chunking, watched retrieval quality crater with real data, then spent months rebuilding with document-aware strategies.

Hybrid search beats semantic-only

Pure semantic search misses exact matches. Ask about “BM25 algorithm” and semantic search might return documents about ranking methods without mentioning BM25 specifically. Ask for a specific error code and you might get general troubleshooting instead of the exact error.

Hybrid search combines semantic and keyword approaches, using something like BM25 for keyword matching alongside vector similarity. The combination catches both: conceptually similar content through embeddings, exact terminology through keywords.

This matters most for technical content. Code snippets. Error messages. Product names. Acronyms. The stuff where exact matching matters more than semantic similarity.

Implementation is straightforward. Run both searches in parallel, then combine results using reciprocal rank fusion or score normalization. Research shows hybrid approaches achieving better precision and recall than either method alone.

Most vector databases now support this natively. Weaviate’s hybrid search scored higher on both page and paragraph-level retrieval in testing. Pinecone offers it. Even PostgreSQL with pgvector extensions can do it.

The performance difference isn’t subtle. Teams report 30-40% improvement in retrieval quality just by adding keyword search to their semantic pipeline.

Measuring what actually works

The hardest part of building rag system isn’t the code - it’s knowing whether it works. RAG evaluation is tricky because you’re measuring two things: retrieval quality and generation quality.

Separate them.

For retrieval: precision and recall. Of the chunks you retrieved, what percentage were actually relevant? Of all relevant chunks, what percentage did you retrieve? Teams using frameworks like RAGAS track these metrics continuously.

Start with a curated test set. Take 50-100 real queries. Manually identify which documents should be retrieved for each. Now you have ground truth to measure against.

Track metrics that matter:

Precision at k (are my top 5 results relevant?)
Recall (did I find all relevant documents?)
Mean reciprocal rank (where does the first relevant result appear?)
NDCG (are more relevant results ranked higher?)

Don’t just measure final responses. A RAG system can give good answers despite bad retrieval if it gets lucky, or give bad answers despite perfect retrieval if generation fails. You need to know which component is breaking.

For generation quality, track faithfulness and answer relevance. Faithfulness: is the answer actually supported by retrieved documents? Answer relevance: does it actually address the question?

Building for production from day one

Here’s what production-ready architecture looks like.

Delta processing for document updates. Don’t re-embed your entire document collection when one page changes. Build a system similar to git diff that only processes what changed. Saves compute, reduces latency, prevents version drift.

Monitoring and alerting. Track retrieval latency, embedding generation time, and database query performance. Set alerts for sudden drops in precision or spikes in retrieval time. Production systems need observability to catch degradation before users complain.

Fallback strategies for missing information. RAG systems break when asked about topics not in the knowledge base. Detect low-confidence retrievals and handle them explicitly rather than hallucinating.

Cost optimization matters at scale. Embedding models vary significantly in cost and latency. OpenAI’s models are expensive but high quality. Open source alternatives like E5 or BGE offer comparable performance at lower cost. Domain-specific models like voyage-finance can outperform general models for specialized content.

Vector database choice impacts both performance and cost. Pinecone handles billions of vectors with consistent sub-50ms latency but costs more. Weaviate gives you flexibility and hybrid search with open source options. Chroma works great for prototypes and smaller deployments. Benchmark performance with your actual data before committing.

Start simple, measure everything, iterate

Building RAG system that works in production means building a retrieval system that works in production. Get that right and generation becomes almost easy.

Start with semantic chunking that respects document structure. Add hybrid search from day one. Measure retrieval quality separately from generation quality. Build monitoring before you need it.

The teams that succeed with RAG aren’t the ones with the fanciest models. They’re the ones who treated retrieval as the hard problem it is.