
Data Manipulation and Preprocessing for LLMs
Master the art of 'Garbage In, Garbage Out'. Learn how to clean raw text, handle problematic encodings, and structure data for optimized RAG and fine-tuning pipelines.
Data Manipulation and Preprocessing for LLMs
The most common reason for a "dumb" LLM response is not a bad model—it's bad data. If you feed a model 10,000 words of messy HTML boilerplate, it will struggle to find the core message. As an LLM Engineer, your job is to be a Text Surgeon.
In this lesson, we will cover the essential Python techniques for cleaning, structuring, and prepping data for ingestion into LLM context windows and vector databases.
1. The "Garbage In, Garbage Out" (GIGO) Rule
LLMs have a finite context window. If 40% of your prompt is filled with \n\n, <div> tags, or [Author: John Doe, 2021], you are wasting money and reducing the model's accuracy.
Common Data Pollutants:
- Whitespaces: Excessive newlines and tabs.
- Metadata: Ad blocks, nav bars, and footer text from web scrapes.
- Encodings: Non-UTF8 characters that can cause the tokenizer to fail.
- Duplicates: Repeating the same information 5 times in a prompt confuses the model's attention mechanism.
2. Text Cleaning Strategies with Python
Level 1: Basic Regex Cleaning
Regex (Regular Expressions) is the LLM Engineer's best friend for stripping unwanted patterns.
import re
def clean_scraped_text(text: str):
# 1. Remove HTML tags (if any)
text = re.sub(r'<[^>]+>', '', text)
# 2. Remove multiple newlines with a single newline
text = re.sub(r'\n+', '\n', text)
# 3. Strip leading/trailing whitespace
return text.strip()
raw_mess = "<div>Hello World</div> \n\n\n Welcome to AI."
print(f"Cleaned: '{clean_scraped_text(raw_mess)}'")
3. Working with Unstructured Data: Markdown is King
LLMs are trained heavily on code and documentation. Consequently, they are much better at reasoning over Markdown than they are over plain text or raw HTML.
The Standard Industry Pipeline:
- Fetch Raw Data (HTML/PDF/JSON).
- Convert to Markdown.
- Chunk the Markdown.
- Send to Vector DB.
graph LR
A[Raw Data] --> B[Parser: BeautifulSoup/PyPDF]
B --> C[Structure: Markdown]
C --> D[Chunker]
D --> E[Vector DB]
Why Markdown?
- Hierarchy: Headlines (
#,##) tell the model what's important. - Lists: Bullet points help the model separate distinct facts.
- Emphasis: Bold text can be used to highlight key entities.
4. Normalizing Tabular Data for LLMs
If you have a CSV or JSON of customer data, how do you give it to the LLM?
- Bad: Send raw JSON:
{"id": 1, "name": "Bob", "spend": 500}. - Better: Natural Language Conversion: "Customer Bob (ID: 1) has spent 500 dollars."
Using Pandas for AI Prep
Pandas is essential for transforming large datasets before they hit the LLM.
import pandas as pd
# Load sample data
df = pd.DataFrame([
{"name": "Alice", "city": "NY", "last_purchase": "Laptop"},
{"name": "Bob", "city": "SF", "last_purchase": "Phone"}
])
# Professional Normalization: Convert to human-readable strings for the LLM
df['narrative'] = df.apply(lambda x: f"Customer {x['name']} is from {x['city']} and recently bought a {x['last_purchase']}.", axis=1)
knowledge_base = "\n".join(df['narrative'].tolist())
print(knowledge_base)
5. Metadata Enrichment
When preprocessing data for a RAG system, don't just store the text. Store the Contextual Metadata.
Why store metadata?
- Filtering: "Show me documents only from 2024."
- Source Attribution: "According to the User Manual (Page 45)..."
- Permissioning: "User A is not allowed to see 'Financial Records'."
Summary
- Clean your text: Strip whitespace and junk characters before tokenizing.
- Use Markdown: It is the native language of LLM reasoning.
- Normalize Tabular Data: Convert rows/columns into sentences.
- Enrich with Metadata: This makes your RAG system "Searchable" and "Auditable."
In the next lesson, we will look at Best Practices for Scalable Python, teaching you how to organize this cleaning and processing code so it doesn't become a "spaghetti" mess as your project grows.
Exercise: The Document Cleaner
You are working with a messy OCR (Optical Character Recognition) snippet from a scanned receipt:
" TOTA L : $45 . 0 0 \n Date : 1 2 / 0 1 / 2 5 \n \n"
Task:
- Write a Python function that uses regex to fix the spacing in "TOTA L" and "1 2 / 0 1".
- Convert the snippet into a clean dictionary:
{"total": 45.0, "date": "12/01/25"}.
Tip: Use re.sub(r'\s+', '', text) to remove ALL internal spaces if necessary, but be careful not to merge separate words!