Text Embeddings

Text Embeddings

Master the fundamentals of text-to-vector transformation, model selection, and vector space theory.

Text Embeddings

Embeddings are the core of RAG. They convert human language into numerical vectors (arrays of numbers) such that words with similar meanings are positioned close to each other in high-dimensional space.

How They Work

An embedding model takes a string of text and outputs a vector (e.g., 1536 numbers for OpenAI's text-embedding-3-small).

# Conceptual example
vector = model.embed("What is RAG?")
# Output: [0.012, -0.045, 0.231, ...]

Key Properties

  1. Semantic Density: Unlike keyword search (which looks for exact characters), embeddings capture the "idea" of a sentence.
  2. Cosine Similarity: The primary way we measure "closeness" between two vectors.
  3. Fixed Dimension: Every output from a given model has the same number of dimensions.

Choosing a Text Embedding Model

ModelProviderDimsKey Strength
text-embedding-3-smallOpenAI1536Cost & Efficiency
titan-embed-text-v2AWS1024Cloud Integrated
bge-small-en-v1.5Open Source384Speed (Local)
voyage-2Voyage AI1024Retrieval Accuracy

The MTEB Benchmark

If you are looking for the "best" model, refer to the Massive Text Embedding Benchmark (MTEB) leaderboard on Hugging Face. It ranks models based on their performance across retrieval, clustering, and classification tasks.

Practical Implementation (OpenAI)

from openai import OpenAI
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

Exercises

  1. Compare the word "Apple" (the fruit) and "Apple" (the company) in vector space using two different sentences.
  2. What happens to the embedding if you change a single word to its synonym?
  3. Why is it important to use the same model for both ingestion and querying?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn