Input Sanitization: Pre-Token Cleaning

User input is often Messy. A user might copy-paste an entire website into your chat, including 5,000 tokens of "Cookie Consent" banners, "Navigation Menus," and "Social Media Footers." If you send that raw data to an LLM, you are paying for Pure Noise.

Input Sanitization is the practice of cleaning the input before it gets tokenized.

In this lesson, we learn how to use BeautifulSoup, Regex, and Markdown Strip to reduce input noise by up to 90%.

1. The HTML "Extraction" Tax

If you are move a RAG system on web data, HTML is your enemy.

Raw HTML: 2,000 tokens (tags, scripts, styles).
Cleaned Text: 200 tokens (The actual content).

The "Clean" Hierarchy:

Tier 1: Strip <script> and <style> tags. (Mandatory).
Tier 2: Convert to Markdown. (Removes <div> and <span> bloat).
Tier 3: Identify and remove "Boilerplate" (Headers/Footers).

2. Removing "Whitespace Bloat"

LLMs treat spaces and newlines as tokens. "Hello World" uses more tokens than "Hello World". Across 1 million requests, Extra Newlines can cost you thousands of dollars.

3. Implementation: The Content Sanitizer (Python)

Python Code: Stripping the Noise

from bs4 import BeautifulSoup
import re

def sanitize_user_input(raw_html):
    soup = BeautifulSoup(raw_html, "html.parser")
    
    # 1. Remove non-content tags
    for s in soup(['script', 'style', 'nav', 'footer', 'header']):
        s.decompose()
        
    text = soup.get_text()
    
    # 2. Collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 3. Truncate 'Repeating' characters (e.g. users holding down a key)
    text = re.sub(r'(.)\1{10,}', r'\1', text)
    
    return text

Token Saving: For a standard news article, this script will reduce the token count from 4,000 to 800.

4. Normalizing Special Characters

Special characters (Smart Quotes, Emojis, Unicode variations) often tokenize into weird, multi-token sequences.

“ (Smart Quote) might be 2 tokens.
" (Standard Quote) is 1 token.

By Normalizing your input to standard ASCII/UTF-8 before sending it to the LLM, you save 5-10% on your total token bill and increase the model's accuracy (as it sees less "Noise").

5. Summary and Key Takeaways

Delete HTML entirely: Use BeautifulSoup to extract text first.
Collapse Whitespace: Replace all multi-space/newline sequences with single ones.
Boilerplate Removal: Use heuristics or libraries (like trafilatura) to find the "Main Content" and ignore the rest.
Normalize Punctuation: Standardize quotes and symbols to minimize multi-token patterns.

In the next lesson, Handling Recursive Attacks, we look at چگونه to stop an agent from "Talking to itself" until your bank account is empty.

Exercise: The Webpage Squeeze

Pick a news article URL.
Step 1: Use requests.get() to find the raw HTML. Count the tokens.
Step 2: Use BeautifulSoup to get the body.text. Count the tokens.
Step 3: Use the re.sub(r'\s+', ' ', text) trick. Count the tokens.
Compare the numbers.

Most students find Step 3 is 80% smaller than Step 1.
Calculate: If you crawl 1,000 pages a day, how much did you save by running that 5-line Python cleaning script?

Input Sanitization: Pre-Token Cleaning

Input Sanitization: Pre-Token Cleaning

1. The HTML "Extraction" Tax

The "Clean" Hierarchy:

2. Removing "Whitespace Bloat"

3. Implementation: The Content Sanitizer (Python)

Python Code: Stripping the Noise

4. Normalizing Special Characters

5. Summary and Key Takeaways

Exercise: The Webpage Squeeze

Congratulations on completing Module 18 Lesson 3! You are now a data sanitizer.

Subscribe to our newsletter