
The Polish: Data Cleansing, Normalization, and Indexing
High-quality retrieval starts with high-quality data. Master the art of stripping noise, normalizing text, and preparing indices for the ultimate RAG performance.
Garbage In, Garbage Out
In the previous lesson, we built the "pipes" (ETL). In this lesson, we look at the "filter." If you feed a Foundation Model raw HTML, documents with footer boilerplate on every page, or inconsistent character encodings, the model's performance will degrade significantly.
For the AWS Certified Generative AI Developer – Professional exam, you must demonstrate competence in preparing data specifically for Vector Indexing. This process is known as Cleaning and Normalization.
1. Why Cleansing Matters for Vectors
When we index data for AI, we convert text into Embeddings (numbers that represent meaning).
- If a document contains "Confidential - Do Not Share" on the bottom of every page, that noise gets embedded.
- When a user asks a question, the AI might retrieve the "Do Not Share" footer instead of the actual content because it appears so frequently.
The Objective:
Strip everything that doesn't contribute to the Semantic Meaning of the content.
2. Common Cleansing Techniques
As a Professional Developer, you should automate these steps inside an AWS Glue job or Lambda function:
| Task | Description | Tool/Method |
|---|---|---|
| Boilerplate Removal | Removing headers, footers, and page numbers. | Regular Expressions (Regex) or Layout-aware parsers. |
| Noise Reduction | Stripping HTML tags, CSS, or Javascript from web scrapes. | Beautiful Soup (Python) or AWS Glue. |
| Deduplication | Removing identical or near-identical paragraphs. | MinHash or simple Hash-based checks. |
| Language Detection | Filtering out documents that aren't in the target language. | Amazon Comprehend. |
3. Normalization: Consistency is Key
Normalization ensures that the AI doesn't see "Apple," "apple," and "APPLE" as completely different concepts in a sensitive context.
- Character Encoding: Always convert to UTF-8. Misencoded characters (like "") can cause embedding models to fail.
- Date Formatting: Convert "Jan 1st, 2024" and "01/01/24" into a standard format (e.g., ISO 8601) so the AI can reason about time accurately.
- Case Folding: In some search scenarios, lowercasing everything helps, though modern LLMs are often "case-smart."
4. The Indexing Process: From Text to Searchable Knowledge
In a traditional database (SQL), you index by "ID" or "Name." In GenAI, we index by Vector.
graph TD
A[Cleaned Markdown Text] --> B[Chunking Engine]
B --> C[Embedding Model: e.g. Titan v2]
C --> D[Vector Index: OpenSearch Serverless]
D --> E[Metadata Store: CreatedDate, SourceURL]
Pro-Tip: Metadata Injection
During the indexing phase, you should attach Metadata to every chunk.
source_file_id: So the user can click a link to the original PDF.access_level: To ensure "HR Data" isn't retrieved for a "Sales" user.last_updated: To allow the search engine to prioritize fresh results.
5. Amazon Bedrock Knowledge Bases "Sync"
AWS makes this easier with the Knowledge Base feature. When you "Sync" a data source, Bedrock automatically:
- Crawls the S3 bucket.
- Extracts text using a managed service.
- Chunks the text according to your settings (Fixed, Overlap, or Hierarchical).
- Calls the Embedding model.
- Upserts (Updates/Inserts) the vectors into your vector store (e.g., Pinecone or OpenSearch).
6. Code Example: Normalizing Text in Python
import re
import unicodedata
def clean_and_normalize(text):
# 1. Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# 2. Normalize unicode characters (e.g. converting accents)
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
# 3. Strip excessive whitespace
text = re.sub(r'\s+', ' ', text).strip()
# 4. Remove known boilerplate (Generic example)
text = text.replace("Company Confidential - Internal Use Only", "")
return text
# Example
raw_input = "<html><body> Nécessité de l'AI... </body></html>"
print(f"Cleaned: '{clean_and_normalize(raw_input)}'")
Knowledge Check: Test Your Cleansing Knowledge
?Knowledge Check
A developer is building a RAG system for a legal firm. The firm's documents contain sensitive 'Privileged' watermarks on every page. Why is it important to remove these watermarks during the data cleansing phase before indexing?
Summary
You've cleaned, normalized, and understood how to index. But where do these vectors actually live? In the next lesson, we will deep dive into Vector Stores and Embeddings, focusing on Amazon OpenSearch Service and how to choose the right embedding engine.
Next Lesson: The Memory of AI: Vector Stores and Embeddings