
Chunking Large Documents: slicing Data
You can't embed a whole book as one vector. Learn strategies for splitting text into meaningful chunks (Recursive, Semantic) for better retrieval.
Chunking Large Documents
If you embed a 100-page PDF as 1 vector, the "meaning" gets diluted. The vector represents the average of the whole book. When you search for "Page 42 details", you won't match.
Strategy: Divide and Conquer
You must split the document into Chunks (e.g., 500 characters each).
Recursive Character Splitter
The standard approach (used in LangChain).
- Try to split by Paragraph (
\n\n). - If too big, split by Sentence (
.). - If too big, split by Word (
).
This preserves semantic structure.
Overlap
Always include Overlap (e.g., 50 characters).
- Chunk 1: "... the quick brown fox jumps"
- Chunk 2: "brown fox jumps over the dog..."
- Why: Ensures that meaning isn't cut in half at the boundary.
Summary
- Small chunks = Precise search.
- Large chunks = More context.
- Sweet Spot: 500-1000 tokens with 10% overlap.
In the next lesson, we perform the Query and Retrieval.