How Text Embeddings Work

In the last lesson, we defined embeddings as coordinates in space. But how does a raw string of text—like "The weather is nice today"—actually become a vector?

It's not a single jump. It's a pipeline. Understanding this pipeline is crucial because the choices made at each step (how you split text, which model you use) directly impact the quality of your search results.

1. Step 1: Tokenization (The Atomic Unit)

Computers don't read words; they read numbers. The process of breaking text into smaller units is called Tokenization.

Word-based vs. Sub-word Tokenization

Historically, we split by whitespace. But languages are messy (think of "pre-processing" or "unhappy").

Modern models use Sub-word Tokenization (like Byte Pair Encoding or WordPiece). This allows the model to handle words it has never seen before by breaking them into familiar parts.

Input: "Apple"
Tokens: ["Ap", "ple"] (represented as IDs like [122, 456])

The Token-ID Mapping

Each token is mapped to a unique integer ID. This ID refers to a row in the model's "Vocabulary Matrix."

2. Step 2: The Embedding Lookup (Initial State)

Imagine the model's vocabulary has 50,000 tokens. For each token, there is a static vector.

Token 122 (Ap) -> [0.1, -0.4, 0.2, ...]
Token 456 (ple) -> [0.3, 0.8, -0.1, ...]

At this stage, these are Static Embeddings. They are context-free. "Bank" the financial institution and "Bank" the river side would have the exact same starting vector. This is a problem that Transformers solved.

3. Step 3: The Transformer (Adding Context)

This is the "AI" part. The tokens are fed into a Transformer architecture. The Transformer uses a mechanism called Self-Attention.

Self-attention allows tokens to "look" at each other. In the sentence: "The bank was closed for the holiday." The word "bank" looks at "closed" and "holiday" and updates its vector to represent a financial institution.

In the sentence: "The bank was muddy from the rain." The word "bank" looks at "muddy" and "rain" and updates its vector to represent geography.

graph TD
    subgraph Attention_Mechanism
    A[The] --- B[bank]
    B --- |looks at| C[muddy]
    B --- |looks at| D[rain]
    end
    B -- refined vector --> E[Contextualized Embedding]

By the end of this step, we have a separate vector for every single token in our sentence.

4. Step 4: Pooling (Squashing into One Vector)

Vector databases usually store one vector per "item" (a paragraph, a document, or a sentence). But we just generated 10 separate vectors for a 10-token sentence. How do we turn 10 into 1?

This process is called Pooling.

Common Pooling Strategies:

CLS Pooling: Taking the vector of the very first token (usually a special [CLS] token meant to represent the whole sentence).
Mean Pooling: Taking the average of all token vectors. This is the most common method for search because it weights all words equally.
Max Pooling: Taking the highest value from each dimension across all tokens.

# Conceptual Pooling Logic
token_vectors = [v1, v2, v3, v4, v5]
sentence_vector = np.mean(token_vectors, axis=0) # Mean Pooling

The resulting Sentence Vector is what you eventually send to Pinecone or Chroma.

5. Chunking: Dealing with Context Limits

Models have a Context Window. For example, a model might only accept 512 tokens at a time. If you have a 10,000-word PDF, you cannot embed it as a single vector. You will lose most of the detail.

Solution: You must split your text into Chunks.

Chunking Strategies for Vector DBs:

Fixed Size: Split every 500 characters. (Fast, but might cut a sentence in half).
Recursive Character: Split at paragraphs, then sentences, then words to keep meaning together.
Semantic Chunking: Use an AI model to find where the "topic" changes and split there.

The Overlap Trick

When chunking, we usually include an Overlap (e.g., 50 characters). This ensures that if important context is at the end of Chunk A, it's also at the beginning of Chunk B.

6. Python Implementation: Encoding Text with Sentence-Transformers

Let's see the full pipeline in action using the sentence-transformers library, which handles tokenization, transformers, and pooling for you in one line.

from sentence_transformers import SentenceTransformer

# 1. Load a pre-trained model
# 'all-MiniLM-L6-v2' is a great balance of speed and quality
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Define our texts
sentences = [
    "The financial bank is open until 5pm.",
    "The river bank is a great spot for fishing.",
    "The data is stored in a vector database."
]

# 3. Encode (This runs Tokenization -> Transformer -> Pooling)
embeddings = model.encode(sentences)

# 4. Inspect the output
print(f"Number of embeddings: {len(embeddings)}")
print(f"Dimensions per embedding: {len(embeddings[0])}")

# Let's check similarity between the two 'bank' sentences
from sklearn.metrics.pairwise import cosine_similarity

sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"\nSimilarity between financial bank and river bank: {sim:.4f}")
# Note: It will be lower than you expect, because the Transformer 
# understood the different contexts!

7. Choosing the Right Model

Not all text embeddings are created equal. You must choose based on your use case:

Use Case	Recommended Model Category
Fast, Local Search	Small Open Source (e.g., MiniLM)
High Precision RAG	Large Proprietary (OpenAI text-embedding-3-large)
Multilingual Search	Multi-lingual Models (e.g., paraphrase-multilingual)
Search within Code	Code-specific Models (e.g., unixcoder)

The MTEB Benchmark

To choose the best model, always check the MTEB (Massive Text Embedding Benchmark) leaderboard on HuggingFace. It ranks models based on their performance in retrieval, clustering, and classification.

Summary and Key Takeaways

The path from a string to a vector is a sophisticated engineering pipeline.

Tokenization breaks text into numbered chunks.
Transformers use attention to add "context" to those chunks.
Pooling squashes multiple token vectors into one "Sentence Vector."
Chunking is required to manage documents longer than the model's context window.

In the next lesson, we will move beyond text and explore Image and Multimodal Embeddings, learning how we can search for a "cute dog" and find both the word and the picture.

Exercise: Tokenization Analysis

Use a free online tokenizer (like the OpenAI Tokenizer at platform.openai.com/tokenizer).

Input a complex technical sentence.
Input a sentence in a different language.
Observe how the "Token IDs" change.
Try a "nonsense" word (e.g., "AntigravityAI-inator"). Notice how it breaks it into multiple tokens.

Understanding how your text is "sliced" is the first step in debugging why a search might be failing.

How Text Embeddings Work: From Characters to Context