Chunking Tables and Spreadsheets

Tables are a nightmare for standard text chunkers. If you split a CSV file at the 500th character, you might end up with half a row in one chunk and the other half in another—both losing their column headers and becoming unsearchable.

The "Header Injection" Strategy

For structured data, every chunk must contain the Column Headers to maintain context.

Example

Original CSV:

Date,Product,Revenue
2023-01-01,Widget A,500
2023-01-02,Widget B,750

Poor Chunk: 2023-01-01,Widget A,500 (What is 500? Is it revenue, units sold, or weight?)

Good Chunk: Header: Date,Product,Revenue | Data: 2023-01-01,Widget A,500

Implementation in Python

import pandas as pd

def chunk_csv(file_path, rows_per_chunk=5):
    df = pd.read_csv(file_path)
    header_str = ", ".join(df.columns)
    
    chunks = []
    for i in range(0, len(df), rows_per_chunk):
        subset = df.iloc[i:i+rows_per_chunk]
        data_str = subset.to_csv(index=False, header=False)
        full_chunk = f"Headers: {header_str}\nData:\n{data_str}"
        chunks.append(full_chunk)
    
    return chunks

Chunking Markdown Tables

If your PDF parser converts tables to Markdown, use a specialized "Table Chunker" that respects the |---|---| boundary. Never split a table in half; if a table is too large for a single chunk, convert it to a natural language summary or "row-by-row" description.

The Natural Language Conversion Method

Sometimes, tables are best indexed as prose:

"On 2023-01-01, Widget A generated 500 in revenue."

This makes it much easier for the embedding model to match user queries like "How much revenue did we make in January?"

Exercises

Take a spreadsheet with 20 columns.
If you chunk it by row, will the embedding be able to find a value in Column 18?
How would you handle a "Merged Cell" in a spreadsheet during chunking?