
Input Sanitization: Pre-Token Cleaning
Learn how to strip noise from user inputs before they hit the LLM. Master the techniques for cleaning HTML, Markdown, and redundant whitespace.
Input Sanitization: Pre-Token Cleaning
User input is often Messy. A user might copy-paste an entire website into your chat, including 5,000 tokens of "Cookie Consent" banners, "Navigation Menus," and "Social Media Footers." If you send that raw data to an LLM, you are paying for Pure Noise.
Input Sanitization is the practice of cleaning the input before it gets tokenized.
In this lesson, we learn how to use BeautifulSoup, Regex, and Markdown Strip to reduce input noise by up to 90%.
1. The HTML "Extraction" Tax
If you are move a RAG system on web data, HTML is your enemy.
- Raw HTML: 2,000 tokens (tags, scripts, styles).
- Cleaned Text: 200 tokens (The actual content).
The "Clean" Hierarchy:
- Tier 1: Strip
<script>and<style>tags. (Mandatory). - Tier 2: Convert to Markdown. (Removes
<div>and<span>bloat). - Tier 3: Identify and remove "Boilerplate" (Headers/Footers).
2. Removing "Whitespace Bloat"
LLMs treat spaces and newlines as tokens.
"Hello World" uses more tokens than "Hello World".
Across 1 million requests, Extra Newlines can cost you thousands of dollars.
3. Implementation: The Content Sanitizer (Python)
Python Code: Stripping the Noise
from bs4 import BeautifulSoup
import re
def sanitize_user_input(raw_html):
soup = BeautifulSoup(raw_html, "html.parser")
# 1. Remove non-content tags
for s in soup(['script', 'style', 'nav', 'footer', 'header']):
s.decompose()
text = soup.get_text()
# 2. Collapse whitespace
text = re.sub(r'\s+', ' ', text).strip()
# 3. Truncate 'Repeating' characters (e.g. users holding down a key)
text = re.sub(r'(.)\1{10,}', r'\1', text)
return text
Token Saving: For a standard news article, this script will reduce the token count from 4,000 to 800.
4. Normalizing Special Characters
Special characters (Smart Quotes, Emojis, Unicode variations) often tokenize into weird, multi-token sequences.
“(Smart Quote) might be 2 tokens."(Standard Quote) is 1 token.
By Normalizing your input to standard ASCII/UTF-8 before sending it to the LLM, you save 5-10% on your total token bill and increase the model's accuracy (as it sees less "Noise").
5. Summary and Key Takeaways
- Delete HTML entirely: Use BeautifulSoup to extract text first.
- Collapse Whitespace: Replace all multi-space/newline sequences with single ones.
- Boilerplate Removal: Use heuristics or libraries (like
trafilatura) to find the "Main Content" and ignore the rest. - Normalize Punctuation: Standardize quotes and symbols to minimize multi-token patterns.
In the next lesson, Handling Recursive Attacks, we look at چگونه to stop an agent from "Talking to itself" until your bank account is empty.
Exercise: The Webpage Squeeze
- Pick a news article URL.
- Step 1: Use
requests.get()to find the raw HTML. Count the tokens. - Step 2: Use BeautifulSoup to get the
body.text. Count the tokens. - Step 3: Use the
re.sub(r'\s+', ' ', text)trick. Count the tokens. - Compare the numbers.
- Most students find Step 3 is 80% smaller than Step 1.
- Calculate: If you crawl 1,000 pages a day, how much did you save by running that 5-line Python cleaning script?