Caching Strategies: Semantic and Exact Matches

The fastest vector query is the one you don't have to make. If two users ask "What is the capital of France?", we don't need to re-embed the string and re-search the database. We should retrieve it from a cache.

In this lesson, we learn about Exact and Semantic caching for vector systems.

1. Exact Match Cache (Redis/Memcached)

This is the simplest form of caching. We store the result of a query keyed by the Hash of the input string.

Pros: Zero cost, extremely fast (1-2ms).
Cons: Only works if the user types the exact same characters. "What is AI?" and "What is AI" (missing punctuation) would be two different cache entries.

2. Semantic Cache: The "Fuzzy" Bypass

A Semantic Cache uses a vector search on a very small, fast local database (like Chroma in-memory) to see if we've answered a "Similar" question recently.

The Logic:
1. User asks: "How do I reset my password?"
2. Search Cache: Find similarity to past queries.
3. If similarity > 0.98, return the cached answer.
4. Only if similarity is low, proceed to the main database and LLM.

3. Implementation: Building a Semantic Cache (Python)

Using GPTCache or a custom Redis-based logic:

import redis
from sentence_transformers import SentenceTransformer

# Setup
r = redis.Redis(host='localhost', port=6379, db=0)
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_cached_result(query_text):
    # 1. Check Exact cache first
    exact_hit = r.get(f"exact:{query_text}")
    if exact_hit: return exact_hit
    
    # 2. Check Semantic cache (Pseudo-code)
    # Search a separate 'cache' collection in your Vector DB
    # If hit, return
    
    return None

4. Cache Eviction: Staying Relevant

Vector data changes. If you updated your documentation today, your cache from yesterday is probably Wrong.

Time-Based (TTL): Invalidate cache entries every 24 hours.
Event-Based: When you run a delete or update on the Vector DB, clear the corresponding cache entries.
Least Recently Used (LRU): Keep only the most common 1,000 queries.

5. Summary and Key Takeaways

Exact Cache: Best for high-traffic, repetitive apps.
Semantic Cache: Best for flexible agents and chatbots.
Reduced Cost: Caching can save you up to 80% on embedding and database costs.
Invalidation: Always have a strategy to clear the cache when the source data changes.

In the next lesson, we’ll look at the big picture: Cost-Performance Trade-offs.

Caching Strategies: Semantic and Exact Matches

Caching Strategies: Semantic and Exact Matches

1. Exact Match Cache (Redis/Memcached)

2. Semantic Cache: The "Fuzzy" Bypass

3. Implementation: Building a Semantic Cache (Python)

4. Cache Eviction: Staying Relevant

5. Summary and Key Takeaways

Congratulations on completing Module 14 Lesson 4! You are now building high-efficiency architectures.

Subscribe to our newsletter