
Text Embeddings
Master the fundamentals of text-to-vector transformation, model selection, and vector space theory.
Text Embeddings
Embeddings are the core of RAG. They convert human language into numerical vectors (arrays of numbers) such that words with similar meanings are positioned close to each other in high-dimensional space.
How They Work
An embedding model takes a string of text and outputs a vector (e.g., 1536 numbers for OpenAI's text-embedding-3-small).
# Conceptual example
vector = model.embed("What is RAG?")
# Output: [0.012, -0.045, 0.231, ...]
Key Properties
- Semantic Density: Unlike keyword search (which looks for exact characters), embeddings capture the "idea" of a sentence.
- Cosine Similarity: The primary way we measure "closeness" between two vectors.
- Fixed Dimension: Every output from a given model has the same number of dimensions.
Choosing a Text Embedding Model
| Model | Provider | Dims | Key Strength |
|---|---|---|---|
text-embedding-3-small | OpenAI | 1536 | Cost & Efficiency |
titan-embed-text-v2 | AWS | 1024 | Cloud Integrated |
bge-small-en-v1.5 | Open Source | 384 | Speed (Local) |
voyage-2 | Voyage AI | 1024 | Retrieval Accuracy |
The MTEB Benchmark
If you are looking for the "best" model, refer to the Massive Text Embedding Benchmark (MTEB) leaderboard on Hugging Face. It ranks models based on their performance across retrieval, clustering, and classification tasks.
Practical Implementation (OpenAI)
from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
text = text.replace("\n", " ")
return client.embeddings.create(input = [text], model=model).data[0].embedding
Exercises
- Compare the word "Apple" (the fruit) and "Apple" (the company) in vector space using two different sentences.
- What happens to the embedding if you change a single word to its synonym?
- Why is it important to use the same model for both ingestion and querying?