
Similarity Search Basics
Deep dive into vector distance metrics: Cosine Similarity, Euclidean Distance, and Inner Product.
Similarity Search Basics
Similarity search is the process of finding the "closest" vectors in high-dimensional space. To do this, we need a mathematical definition of "closeness."
The Three Main Distance Metrics
1. Cosine Similarity (Recommended for RAG)
Measures the angle between two vectors. It doesn't care about the "length" (magnitude) of the vector, only the direction.
- Range: -1 to 1 (where 1 is identical).
- Pro: Great for text where document length can vary significantly.
2. Euclidean Distance (L2)
Measures the straight-line distance between two points.
- Equation:
$\sqrt{\sum (p_i - q_i)^2}$ - Con: Sensitive to the magnitude of the vectors.
3. Inner Product (Dot Product)
Calculates the sum of the products of corresponding components.
- Equation:
$\sum p_i q_i$ - Note: If your vectors are normalized (length = 1), Inner Product is mathematically equivalent to Cosine Similarity.
Precision vs. Recall
In search, you encounter a trade-off:
- Precision: How many of the results were actually relevant?
- Recall: Did we find all the relevant items that exist?
Vector searches are often Approximate (ANN), meaning we trade a tiny bit of precision for a massive increase in speed.
The Search Process in RAG
- User Query: "Where is the budget report?"
- Embed Query: Query →
[0.12, -0.4, ...] - Compare: Compare query vector against all document vectors in the DB.
- Rank: Sort by highest Cosine Similarity.
- Top-K: Return the top
kresults (usually 3-10).
Why Naive Search isn't enough
Vector search is great at finding "ideas," but it sometimes misses specific details (like a part number AB-123). This is why we move toward Advanced Retrieval.
Exercises
- Calculate the Cosine Similarity between
[1, 0]and[0, 1]manually. - Why would a vector search for "Apple" possibly return a result about "Bananas" instead of "Computers"?
- Look up the
spaceparameter in Chroma. Which metric does it use by default?