
Why Chunking is Critical
Understand the fundamental role of chunking in determining retrieval relevance and LLM response quality.
Why Chunking is Critical
Chunking is the process of breaking large documents into smaller, semantically meaningful pieces. In RAG, "Goldilocks" chunking (not too big, not too small) is the difference between an accurate answer and a hallucination.
The Retrieval Dilemma
If your chunks are too large:
- The embedding becomes "diluted" and matches too many things.
- You might exceed the LLM's context window.
- The LLM gets overwhelmed with "noise."
If your chunks are too small:
- You lose the surrounding context (e.g., a sentence that says "He did it" without knowing who "He" is).
- Retrieval becomes fragmented.
The Metric: Context Window vs. Information Density
Modern LLMs have huge context windows (128k+ tokens), but their ability to find a "needle in a haystack" decreases as the haystack grows. Small, precise chunks are still the best practice for accurate grounding.
Modality-Specific Chunking
Chunking isn't just for text.
- Images: Can be "chunked" by cropping relevant sub-regions (e.g., individual charts in a large infographic).
- Audio: Chunked into 30-90 second semantic segments.
- Video: Chunked by scenes or chapters.
Impact on Retrieval Accuracy
In a typical RAG evaluation (like RAGAS), chunking strategy alone can account for a 20-30% difference in "Faithfulness" and "Answer Relevance" scores.
Key Considerations for Chunking
- Model Token Limits: Ensure chunks fit into the embedding model's max input (often 512 or 8192 tokens).
- Metadata Association: Each chunk must carry its parent document's metadata.
- Retrieval Latency: More chunks mean more calculations, though modern vector DBs handle millions of chunks effortlessly.
Exercises
- Explain why a "Single Sentence" chunking strategy might fail for a complex legal contract.
- If you have a 10,000-page book, how many 500-token chunks would you expect to generate?
- Why might you want to store "Overlapping" chunks?