Module 5 Lesson 3: Text Splitters and Chunking
Optimizing for Logic. Why we must split long documents into smaller 'Chunks' to fit within LLM context windows.
Text Splitters: The Art of Chunking
You've loaded your 50-page PDF. You can't just send all 50 pages to the model at once—it's too many tokens, and it will be too expensive. You need to Split the document into "Chunks."
1. Why Chunk?
- Context Window: Models have a limit (e.g., 128k tokens).
- Precision: If you ask about a specific clause in a contract, you only want to retrieve the 2 paragraphs related to that clause, not the entire chapter.
- Cost: Smaller inputs mean smaller bills.
2. The RecursiveCharacterTextSplitter
This is the standard, most reliable splitter in LangChain. It tries to split by "Paragraphs" first, then "Sentences," and finally "Words" to ensure your chunks don't break in the middle of a thought.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = "A very long document..."
# 1. Initialize
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Max characters per chunk
chunk_overlap=100 # Repeat 100 characters from previous chunk
)
# 2. Execute
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks.")
3. The Power of "Overlap"
Why do we repeat 100 characters? Context Preservation. Imagine Chunk 1 ends with: "The suspect's name was..." And Chunk 2 starts with: "John Smith." Without overlap, neither chunk "Knows" who John Smith is. With overlap, both chunks contain the full sentence.
4. Visualizing Chunks
graph TD
Raw[Huge Text: 10,000 chars] --> S[Splitter]
S --> C1[Chunk 1: 1,000 chars]
S --> C2[Chunk 2: 1,000 chars]
S --> C3[...]
C1 -- Overlap --> C2
C2 -- Overlap --> C3
5. Engineering Tip: Token-Based Splitting
Characters (letters) are not the same as Tokens (AI words). If you want perfect control, use the TokenTextSplitter. This ensures your chunks are exactly the same size as the model's internal memory blocks.
Key Takeaways
- Chunking is mandatory for large-scale data ingestion.
chunk_sizedetermines the amount of information per block.chunk_overlapprevents loss of context at the edges.- Recursive splitters are best because they respect the structure of human language.