
Caching Strategies: Semantic and Exact Matches
Learn how to bypass your vector database entirely using caching. Master Exact Match, Semantic Cache, and Result Deduplication.
Caching Strategies: Semantic and Exact Matches
The fastest vector query is the one you don't have to make. If two users ask "What is the capital of France?", we don't need to re-embed the string and re-search the database. We should retrieve it from a cache.
In this lesson, we learn about Exact and Semantic caching for vector systems.
1. Exact Match Cache (Redis/Memcached)
This is the simplest form of caching. We store the result of a query keyed by the Hash of the input string.
- Pros: Zero cost, extremely fast (1-2ms).
- Cons: Only works if the user types the exact same characters. "What is AI?" and "What is AI" (missing punctuation) would be two different cache entries.
2. Semantic Cache: The "Fuzzy" Bypass
A Semantic Cache uses a vector search on a very small, fast local database (like Chroma in-memory) to see if we've answered a "Similar" question recently.
- The Logic:
- User asks: "How do I reset my password?"
- Search Cache: Find similarity to past queries.
- If similarity > 0.98, return the cached answer.
- Only if similarity is low, proceed to the main database and LLM.
3. Implementation: Building a Semantic Cache (Python)
Using GPTCache or a custom Redis-based logic:
import redis
from sentence_transformers import SentenceTransformer
# Setup
r = redis.Redis(host='localhost', port=6379, db=0)
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_cached_result(query_text):
# 1. Check Exact cache first
exact_hit = r.get(f"exact:{query_text}")
if exact_hit: return exact_hit
# 2. Check Semantic cache (Pseudo-code)
# Search a separate 'cache' collection in your Vector DB
# If hit, return
return None
4. Cache Eviction: Staying Relevant
Vector data changes. If you updated your documentation today, your cache from yesterday is probably Wrong.
- Time-Based (TTL): Invalidate cache entries every 24 hours.
- Event-Based: When you run a
deleteorupdateon the Vector DB, clear the corresponding cache entries. - Least Recently Used (LRU): Keep only the most common 1,000 queries.
5. Summary and Key Takeaways
- Exact Cache: Best for high-traffic, repetitive apps.
- Semantic Cache: Best for flexible agents and chatbots.
- Reduced Cost: Caching can save you up to 80% on embedding and database costs.
- Invalidation: Always have a strategy to clear the cache when the source data changes.
In the next lesson, we’ll look at the big picture: Cost-Performance Trade-offs.