
Chunking Tables and Spreadsheets
Master the art of breaking down structured data into searchable chunks for RAG pipelines.
Chunking Tables and Spreadsheets
Tables are a nightmare for standard text chunkers. If you split a CSV file at the 500th character, you might end up with half a row in one chunk and the other half in another—both losing their column headers and becoming unsearchable.
The "Header Injection" Strategy
For structured data, every chunk must contain the Column Headers to maintain context.
Example
Original CSV:
Date,Product,Revenue
2023-01-01,Widget A,500
2023-01-02,Widget B,750
Poor Chunk:
2023-01-01,Widget A,500 (What is 500? Is it revenue, units sold, or weight?)
Good Chunk:
Header: Date,Product,Revenue | Data: 2023-01-01,Widget A,500
Implementation in Python
import pandas as pd
def chunk_csv(file_path, rows_per_chunk=5):
df = pd.read_csv(file_path)
header_str = ", ".join(df.columns)
chunks = []
for i in range(0, len(df), rows_per_chunk):
subset = df.iloc[i:i+rows_per_chunk]
data_str = subset.to_csv(index=False, header=False)
full_chunk = f"Headers: {header_str}\nData:\n{data_str}"
chunks.append(full_chunk)
return chunks
Chunking Markdown Tables
If your PDF parser converts tables to Markdown, use a specialized "Table Chunker" that respects the |---|---| boundary. Never split a table in half; if a table is too large for a single chunk, convert it to a natural language summary or "row-by-row" description.
The Natural Language Conversion Method
Sometimes, tables are best indexed as prose:
"On 2023-01-01, Widget A generated 500 in revenue."
This makes it much easier for the embedding model to match user queries like "How much revenue did we make in January?"
Exercises
- Take a spreadsheet with 20 columns.
- If you chunk it by row, will the embedding be able to find a value in Column 18?
- How would you handle a "Merged Cell" in a spreadsheet during chunking?