Preparing the Bits: Chunking & Embeddings

You cannot feed a 500-page PDF to a vector database in one go. You must first break it down into Chunks.

1. What is Chunking?

Chunking is the process of splitting a long document into smaller, searchable pieces.

Fixed-size Chunking: (e.g., Every 500 tokens). Simple but can cut a sentence in half.
Hierarchical Chunking: Respects headers and paragraphs. Better for context.
Overlap: We usually overlap chunks (e.g., 20% overlap) so that the end of one chunk flows into the start of the next, ensuring no context is lost.

2. Managed Embeddings

Once we have chunks, we turn them into Vectors (lists of numbers).

In Bedrock, you select an Embedding Model (like amazon.titan-embed-text-v1).
Bedrock automatically calls this model for every chunk and saves the result in your Vector Database (OpenSearch).

3. Visualizing the Math Space

graph LR
    C1[Chunk 1: 'Dogs are pets'] --> E[Embedding Model]
    E --> V1[[0.1, -0.9, 0.4]]
    
    C2[Chunk 2: 'Puppies are cute'] --> E
    E --> V2[[0.11, -0.89, 0.42]]
    
    V1 -.- V2[Close together: Similar Meaning]

4. Why Chunking Strategy Matters

If your chunks are too small, the AI misses the "Big Picture." If your chunks are too large, the search results are noisy and expensive to process.

Standard Pick: 300-500 tokens with 20% overlap is the "Sweet Spot" for most documents.

Summary

Chunking creates manageable pieces of data for retrieval.
Overlap ensures context is preserved across split points.
Embeddings translate text meaning into mathematical coordinates.
Managed Pipelines handle the "Sync" between S3 and your Vector DB.

Module 7 Lesson 2: Chunking and Embeddings