Production RAG: The Chunking, Retrieval, and Evaluation Strategies That Actually Work

Production RAG: The Chunking, Retrieval, and Evaluation Strategies That Actually Work
  • Fixed-size chunking at 512 tokens actively destroys the context your retrieval system needs to work
  • Semantic chunking that respects document structure can improve retrieval accuracy by up to 30%
  • RAG is fundamentally a system design problem, not just a retrieval optimization challenge
  • Bridging the demo-to-production gap requires rethinking chunking, retrieval, and evaluation as interconnected components

The Painful Reality

Your team spent weeks building a RAG prototype. The demo went great. Leadership was impressed. Then you deployed it, and the complaints started rolling in. Users get irrelevant answers. The system confidently cites context that has nothing to do with their question. Important information gets missed entirely, even though you know it's in the knowledge base.

This isn't a bug in your code. It's a fundamental architectural problem that almost every RAG implementation shares. The basic tutorial approach that works beautifully on clean, simple documents falls apart the moment you introduce real-world complexity: nested document structures, questions spanning multiple sections, tables and lists, thousands of documents to search through.

The Root Cause: Context Destruction

The standard RAG tutorial teaches you to split documents into fixed-size chunks of around 512 tokens, embed them, and retrieve the top-k similar chunks. This approach has a fatal flaw: it treats text as a stream of tokens rather than structured information with meaning.

When you split every 512 tokens, paragraphs get severed mid-thought. The second half lands in a different chunk, stripped of the context that made it meaningful. When a user asks a question that touches on that concept, your retrieval system might find one half but miss the other. The LLM then generates an answer from incomplete information, producing the hallucinations and missed context that plague production systems.

The Solution: Structure-Aware Chunking

Semantic chunking splits on natural boundaries like paragraphs and sections while respecting size limits. You preserve the logical flow of ideas. You add overlap between chunks so concepts spanning boundaries get captured in both. For structured documents like technical manuals or legal contracts, you preserve hierarchy by including section headers and parent context in your chunk metadata.

When retrieval happens, the system understands not just what text matched, but where it sits in the document's structure. A chunk about database configuration includes the header "Database Setup" and knows it's part of the "Installation Guide" section. This context helps both retrieval accuracy and answer generation.

Semantic chunking can improve retrieval accuracy by up to 30% compared to fixed-size approaches. That's not a marginal gain from hyperparameter tuning. That's the difference between a system that frustrates users and one that actually helps them.

Tools for Better Chunking

The good news is you don't have to build structure-aware chunking from scratch. A growing ecosystem of tools can help, ranging from open source libraries to commercial services. The choice matters especially when you're dealing with complex file formats like PDFs, PowerPoints, and Word documents where structure isn't just text—it's slides, tables, headers, and visual hierarchy.

Open Source Options

LangChain provides multiple text splitters out of the box, including recursive character splitting, markdown-aware splitting, and HTML-based splitting. It's a good starting point but requires configuration to get semantic chunking right.

LlamaIndex offers more sophisticated node parsers that can preserve document hierarchy. Its SentenceSplitter and SemanticSplitterNodeParser are specifically designed for context-preserving chunking, and it integrates well with various document loaders.

Unstructured is the go-to open source library for parsing complex documents. It handles PDFs, PowerPoints, Word docs, HTML, and more, extracting not just text but structural elements like titles, tables, and list items. This structural awareness makes downstream chunking far more effective.

Chonkie is a newer lightweight library focused specifically on chunking strategies. It provides semantic chunking, sentence-based chunking, and token-based approaches with a simple API.

Docling from IBM Research excels at understanding document layout, particularly for PDFs with complex structures. It preserves reading order, tables, and hierarchical sections—critical for technical documentation and reports.

Commercial Solutions

Unstructured.io (the commercial offering) provides hosted document processing with better accuracy on complex layouts, OCR for scanned documents, and higher throughput than the open source version. Worth evaluating if your documents include scanned PDFs or complex visual layouts.

LlamaParse from LlamaIndex focuses specifically on high-fidelity document parsing. It handles tables, charts, and embedded images particularly well, outputting clean markdown that preserves document structure for chunking.

Azure AI Document Intelligence (formerly Form Recognizer) offers pre-built models for invoices, receipts, and contracts, plus custom model training. Strong choice if you're already in the Azure ecosystem and need to process structured business documents.

AWS Textract provides similar capabilities in the AWS ecosystem, with good table extraction and form parsing. Integrates naturally with other AWS services for end-to-end pipelines.

Reducto specializes in turning complex documents into clean, structured data optimized for RAG. It handles edge cases like multi-column layouts and embedded tables that trip up simpler parsers.

Choosing the Right Tool

For plain text and markdown, the open source libraries are usually sufficient. The complexity explodes when you hit real enterprise documents: scanned PDFs, PowerPoints with speaker notes, Excel files with multiple sheets, Word documents with track changes.

Start with Unstructured (open source) for most document types. It handles the widest range of formats and gives you structural metadata to inform your chunking strategy. If you're hitting accuracy issues with complex layouts or scanned documents, evaluate the commercial options—the cost often pays for itself in reduced manual cleanup and better retrieval quality.

The key insight: parsing and chunking are separate concerns. Use a robust parser to extract structure, then apply semantic chunking to the parsed output. Teams that conflate these steps—or skip structure extraction entirely—end up with the context destruction problems we discussed earlier.

Beyond Chunking: System Design Thinking

Most organizations approach RAG as a retrieval optimization problem. They tune embedding models, experiment with different vector databases, adjust the number of chunks retrieved. These tweaks help at the margins, but they can't fix a system that's feeding broken context to the LLM in the first place.

RAG is fundamentally a system design problem, not just a retrieval problem. The chunking strategy, retrieval approach, reranking logic, and evaluation framework all need to work together. Hybrid retrieval approaches that combine vector search with traditional keyword matching like BM25 consistently outperform either method alone. Vector search finds semantically similar content but misses exact terminology. Keyword search catches precise matches but misses paraphrases. Together, they cover each other's blind spots.

Adding a cross-encoder reranking step on top of hybrid retrieval further improves precision by evaluating query-document pairs more carefully than initial retrieval can. The evaluation metrics matter too. Most teams measure retrieval recall and precision, which tells you whether you're finding relevant chunks. But user satisfaction depends on whether the final answer is correct, complete, and properly grounded.

Business Impact and Staffing

This has real implications for how you staff and manage RAG projects. You don't just need ML engineers who understand embeddings. You need people who can think about information architecture, who understand how your documents are structured and how your users actually ask questions. The teams that succeed at production RAG treat it as an end-to-end system rather than a collection of independent components to optimize separately.

Preserving document hierarchy and structure is critical for accurate retrieval. A legal document where contract clauses reference earlier definitions needs that hierarchical context preserved. A technical manual where troubleshooting steps reference configuration settings from previous sections needs those connections maintained.

The Path Forward

The RAG hype cycle is entering its accountability phase. Organizations that rushed to deploy basic implementations are discovering that demos and production systems have very different requirements. This creates an opportunity for teams willing to do the harder work of proper system design.

The companies that will win with RAG aren't the ones with the fanciest embedding models or the largest vector databases. They're the ones that understand their documents' structure, respect how information flows within them, and build retrieval systems that preserve context rather than destroy it. The 30% improvement from better chunking is just the starting point. Layer on hybrid retrieval, intelligent reranking, and proper evaluation frameworks, and you can build RAG systems that actually deliver on their promise.

But it starts with acknowledging that the tutorial approach was never meant for production.


Originally reported by Towards AI

Read more

GenAI Daily - February 20, 2026: World Labs Secures $1B for Spatial AI, Inertia Raises Record Fusion Capital, ServiceNow Warns of Software Shakeout

GenAI Daily - February 20, 2026: World Labs Secures $1B for Spatial AI, Inertia Raises Record Fusion Capital, ServiceNow Warns of Software Shakeout

Top Story Fei-Fei Li's World Labs Raises $1 Billion for Spatial Intelligence Revolution World Labs, the spatial intelligence startup founded by AI pioneer Fei-Fei Li, raised $1 billion in new funding from investors including AMD, Nvidia, software firm Autodesk, Emerson Collective, Fidelity Management & Research Company, and Sea.

By Falk Brauer