The Polish: Data Cleansing, Normalization, and Indexing

The Polish: Data Cleansing, Normalization, and Indexing

High-quality retrieval starts with high-quality data. Master the art of stripping noise, normalizing text, and preparing indices for the ultimate RAG performance.

Garbage In, Garbage Out

In the previous lesson, we built the "pipes" (ETL). In this lesson, we look at the "filter." If you feed a Foundation Model raw HTML, documents with footer boilerplate on every page, or inconsistent character encodings, the model's performance will degrade significantly.

For the AWS Certified Generative AI Developer – Professional exam, you must demonstrate competence in preparing data specifically for Vector Indexing. This process is known as Cleaning and Normalization.


1. Why Cleansing Matters for Vectors

When we index data for AI, we convert text into Embeddings (numbers that represent meaning).

  • If a document contains "Confidential - Do Not Share" on the bottom of every page, that noise gets embedded.
  • When a user asks a question, the AI might retrieve the "Do Not Share" footer instead of the actual content because it appears so frequently.

The Objective:

Strip everything that doesn't contribute to the Semantic Meaning of the content.


2. Common Cleansing Techniques

As a Professional Developer, you should automate these steps inside an AWS Glue job or Lambda function:

TaskDescriptionTool/Method
Boilerplate RemovalRemoving headers, footers, and page numbers.Regular Expressions (Regex) or Layout-aware parsers.
Noise ReductionStripping HTML tags, CSS, or Javascript from web scrapes.Beautiful Soup (Python) or AWS Glue.
DeduplicationRemoving identical or near-identical paragraphs.MinHash or simple Hash-based checks.
Language DetectionFiltering out documents that aren't in the target language.Amazon Comprehend.

3. Normalization: Consistency is Key

Normalization ensures that the AI doesn't see "Apple," "apple," and "APPLE" as completely different concepts in a sensitive context.

  • Character Encoding: Always convert to UTF-8. Misencoded characters (like "") can cause embedding models to fail.
  • Date Formatting: Convert "Jan 1st, 2024" and "01/01/24" into a standard format (e.g., ISO 8601) so the AI can reason about time accurately.
  • Case Folding: In some search scenarios, lowercasing everything helps, though modern LLMs are often "case-smart."

4. The Indexing Process: From Text to Searchable Knowledge

In a traditional database (SQL), you index by "ID" or "Name." In GenAI, we index by Vector.

graph TD
    A[Cleaned Markdown Text] --> B[Chunking Engine]
    B --> C[Embedding Model: e.g. Titan v2]
    C --> D[Vector Index: OpenSearch Serverless]
    D --> E[Metadata Store: CreatedDate, SourceURL]

Pro-Tip: Metadata Injection

During the indexing phase, you should attach Metadata to every chunk.

  • source_file_id: So the user can click a link to the original PDF.
  • access_level: To ensure "HR Data" isn't retrieved for a "Sales" user.
  • last_updated: To allow the search engine to prioritize fresh results.

5. Amazon Bedrock Knowledge Bases "Sync"

AWS makes this easier with the Knowledge Base feature. When you "Sync" a data source, Bedrock automatically:

  1. Crawls the S3 bucket.
  2. Extracts text using a managed service.
  3. Chunks the text according to your settings (Fixed, Overlap, or Hierarchical).
  4. Calls the Embedding model.
  5. Upserts (Updates/Inserts) the vectors into your vector store (e.g., Pinecone or OpenSearch).

6. Code Example: Normalizing Text in Python

import re
import unicodedata

def clean_and_normalize(text):
    # 1. Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    
    # 2. Normalize unicode characters (e.g. converting accents)
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    
    # 3. Strip excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 4. Remove known boilerplate (Generic example)
    text = text.replace("Company Confidential - Internal Use Only", "")
    
    return text

# Example
raw_input = "<html><body>  Nécessité de l'AI...   </body></html>"
print(f"Cleaned: '{clean_and_normalize(raw_input)}'")

Knowledge Check: Test Your Cleansing Knowledge

?Knowledge Check

A developer is building a RAG system for a legal firm. The firm's documents contain sensitive 'Privileged' watermarks on every page. Why is it important to remove these watermarks during the data cleansing phase before indexing?


Summary

You've cleaned, normalized, and understood how to index. But where do these vectors actually live? In the next lesson, we will deep dive into Vector Stores and Embeddings, focusing on Amazon OpenSearch Service and how to choose the right embedding engine.


Next Lesson: The Memory of AI: Vector Stores and Embeddings

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn