Inspecting Embeddings and Similarity Scores

Inspecting Embeddings and Similarity Scores

Techniques for debugging the mathematical heart of your RAG system by analyzing vector distances and index quality.

Inspecting Embeddings and Similarity Scores

If the wrong documents are bubbling to the top, it's often because your Similarity Scores are misleading. Inspecting these scores is the first step in tuning your vector DB.

The Similarity Distribution

In a healthy RAG system:

  • Relevant documents should have scores of 0.85 - 0.98.
  • Irrelevant documents should have scores of 0.30 - 0.60.

If all your documents (relevant or not) have scores of 0.75 - 0.82, your embedding space is "Too Crowded," and the model can't tell the difference between them.

Debugging "False Positives"

A false positive is an irrelevant doc with a high similarity score. Causes:

  1. Model Bias: The model might over-value certain keywords (e.g. any document with the word "AI" gets a high score regardless of context).
  2. Normalization Errors: If you forgot to normalize your vectors but are using Dot Product search.

Visualizing Vectors (Dimensionality Reduction)

To "see" your embeddings, you must project them from 1024 dimensions down to 2 or 3.

  • t-SNE / UMAP: Algorithms that group similar documents together in a 2D plot.
  • Tooling: Use the TensorBoard Embedding Projector or Chroma's native visualization tools.

Implementation: Inspecting Chroma Scores

results = collection.query(
    query_texts=["Search query"],
    n_results=10,
    include=["distances", "documents", "metadatas"]
)

for i in range(len(results['ids'][0])):
    print(f"Doc: {results['ids'][0][i]} | Distance: {results['distances'][0][i]}")

(Note: Chroma returns 'Distance'. Lower distance = higher similarity).

Exercises

  1. Print the "Distances" for 10 different queries in your RAG app.
  2. Is there a consistent "Distance Threshold" (e.g., 0.5) that separates good results from bad?
  3. What happens to the distance if you search for gibberish like asdfjkl;?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn