Text Splitters: The Art of Chunking

You've loaded your 50-page PDF. You can't just send all 50 pages to the model at once—it's too many tokens, and it will be too expensive. You need to Split the document into "Chunks."

1. Why Chunk?

Context Window: Models have a limit (e.g., 128k tokens).
Precision: If you ask about a specific clause in a contract, you only want to retrieve the 2 paragraphs related to that clause, not the entire chapter.
Cost: Smaller inputs mean smaller bills.

2. The RecursiveCharacterTextSplitter

This is the standard, most reliable splitter in LangChain. It tries to split by "Paragraphs" first, then "Sentences," and finally "Words" to ensure your chunks don't break in the middle of a thought.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "A very long document..."

# 1. Initialize
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Max characters per chunk
    chunk_overlap=100  # Repeat 100 characters from previous chunk
)

# 2. Execute
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks.")

3. The Power of "Overlap"

Why do we repeat 100 characters? Context Preservation. Imagine Chunk 1 ends with: "The suspect's name was..." And Chunk 2 starts with: "John Smith." Without overlap, neither chunk "Knows" who John Smith is. With overlap, both chunks contain the full sentence.

4. Visualizing Chunks

graph TD
    Raw[Huge Text: 10,000 chars] --> S[Splitter]
    S --> C1[Chunk 1: 1,000 chars]
    S --> C2[Chunk 2: 1,000 chars]
    S --> C3[...]
    
    C1 -- Overlap --> C2
    C2 -- Overlap --> C3

5. Engineering Tip: Token-Based Splitting

Characters (letters) are not the same as Tokens (AI words). If you want perfect control, use the TokenTextSplitter. This ensures your chunks are exactly the same size as the model's internal memory blocks.

Key Takeaways

Chunking is mandatory for large-scale data ingestion.
chunk_size determines the amount of information per block.
chunk_overlap prevents loss of context at the edges.
Recursive splitters are best because they respect the structure of human language.

Module 5 Lesson 3: Text Splitters and Chunking