Similarity Metrics: The Math of "Closeness"

We have reached the final lesson of Module 2: Embeddings Fundamentals. You know how to create vectors and how to choose their size. Now, we must answer the most critical question: How do we measure if two vectors are "similar"?

In a vector database, similarity is not a checkbox; it's a Distance Calculation. There are three primary metrics used in production AI systems: Cosine Similarity, Dot Product, and Euclidean Distance (L2).

Choosing the wrong metric for your model can result in completely nonsensical search results.

1. Cosine Similarity: The Industry Standard

Cosine similarity measures the angle between two vectors. It ignores the "length" (magnitude) of the vectors and focuses only on their direction.

Why it's the "Go To" for Text

Imagine a short sentence: "I like cats."
Now imagine a long paragraph that repeats: "I like cats, I like cats, I like cats..."

A keyword search would see these as very different because one has more "cat" tokens. But semantically, they mean the exact same thing. Because Cosine Similarity looks at the angle, it sees both vectors pointing in the "Cat Love" direction and gives them a score of 1.0 (Identical).

graph TD
    subgraph Space
    A((Vector A))
    B((Vector B))
    C((Vector C))
    A ---|Angle| B
    style A fill:#f9f,stroke:#333
    style B fill:#ccf,stroke:#333
    end

Range: -1.0 to 1.0 (For most embeddings, it's 0 to 1).
1.0: Perfect match (same direction).
0.0: Orthogonal (no relationship).

2. Dot Product: The Raw Choice

The Dot Product is mathematically simpler but more powerful. It multiplies the components of two vectors and sums them up. Unlike Cosine, it cares about magnitude.

If "Document A" is longer or has "stronger" embeddings than "Document B," it will produce a higher Dot Product score, even if they point in the same direction.

When to use Dot Product:

Normalized Vectors: If your vectors are already normalized (length = 1.0), then the Dot Product is identical to Cosine Similarity. Most modern models (OpenAI, HuggingFace) produce normalized vectors.
Maximum Performance: Dot Product is the fastest calculation for a CPU to perform. It is often the default choice in high-scale vector databases like Pinecone.

3. Euclidean Distance (L2): The Geographic Choice

Euclidean distance is the "straight line" distance between two points in space. It's essentially the Pythagorean theorem applied to 1536 dimensions.

Unlike Cosine (which prefers direction), Euclidean distance treats every point as a precise location.

When to use Euclidean Distance:

Physical Measurements: If your vectors represent real-world coordinates, sensor data, or image pixel positions.
Dense Clusters: If you want to group items into specific "buckets" (Clustering), Euclidean is often more effective than Cosine.

# Euclidean Distance formula in N-dimensions
distance = sqrt(sum((a - b)^2))

4. Comparison Table: Which one should you pick?

Metric	Focuses On	Best For	Typical Range
Cosine	Direction	NLP, Text RAG, Semantic Search	0 to 1
Dot Product	Direction & Magnitude	Normalized vectors, Recommenders	-inf to +inf
Euclidean	Distance	Clustering, Computer Vision, Geospatial	0 to +inf

5. The Golden Rule of Similarity

The metric you use in the database MUST match the metric the model was trained with.

If the researchers who built the BGE-M3 model used Cosine Similarity to train it, but you configure your Chroma database to use Euclidean Distance, your search results will be unreliable.

OpenAI: Recommends Cosine Similarity.
Pinecone: Default is Cosine, but supports all three.
HuggingFace: Usually Cosine for text models.

6. Python Implementation: Calculating Metrics from Scratch

Let's look at the raw math in Python using numpy.

import numpy as np

# Two 3D vectors
v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6]) # v2 is just v1 doubled in length

def cosine_sim(a, b):
    # dot(a, b) / (norm(a) * norm(b))
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def dot_product(a, b):
    return np.dot(a, b)

def euclidean_dist(a, b):
    return np.linalg.norm(a - b)

print(f"Vector 1: {v1}")
print(f"Vector 2: {v2}\n")

print(f"Cosine Similarity: {cosine_sim(v1, v2):.4f}") 
# Result: 1.000 (They point in the same direction)

print(f"Dot Product: {dot_product(v1, v2):.4f}") 
# Result: 28.000 (Very large because of magnitude)

print(f"Euclidean Distance: {euclidean_dist(v1, v2):.4f}") 
# Result: 3.741 (They are physically far apart)

7. Performance at Scale: SIMD and Quantization

When your vector database has 100 million vectors, it doesn't loop through them one by one. It uses SIMD (Single Instruction, Multiple Data).

Modern CPUs can perform 16 or 32 float multiplications in a single clock cycle. This is why we can search millions of vectors in milliseconds.

However, even with SIMD, 100 million 1536D vectors is a lot of data. Vector databases often use Quantization (as mentioned in Module 3) to convert 32-bit floats into 8-bit integers or even 1-bit binaries. This speeds up the similarity calculation by 4x to 30x, with only a 1-2% loss in accuracy.

Summary and Key Takeaways

Similarity metrics are the "Compass" of the vector database.

Cosine Similarity is the industry standard for NLP and RAG because it focuses on meaning, not length.
Dot Product is fastest and best for normalized vectors.
Euclidean Distance is best for raw spatial data and clustering.
Consistency is mandatory: Always use the metric your model's creators recommend.

Module 2 Wrap-up

You have now mastered the Embeddings Fundamentals. You understand what vectors are, how they are made, how big they should be, and how to measure them.

In Module 3: Vector Search Concepts, we will move from the "Math" to the "Engineering." We will learn how vector databases actually index this data for lightning-fast retrieval using algorithms like HNSW and IVF.

Exercise: Comparing Metrics

Take a piece of paper and draw a 2D grid.

Point A: (2, 2)
Point B: (4, 4)
Point C: (2, 0)

Which two points have the highest Cosine Similarity? (Which ones point in the same direction from the origin 0,0?)
Which two points have the smallest Euclidean Distance? (Which ones are physically closest?)
If Point A is "I like coffee" and Point B is "I really, really like coffee," which metric would see them as more similar?

Congratulations! You have finished Module 2. See you in Module 3.

Similarity Metrics: The Math of 'Closeness'