
Caching Strategies in RAG
Implement Multi-Level Caching to avoid redundant calculations and reduce RAG costs.
Caching Strategies in RAG
"The fastest query is the one you never had to run." Caching is the key to scaling RAG to thousands of concurrent users.
Layer 1: Prompt Caching (Model Layer)
As discussed in Module 16, this caches the "System Prompt" and "Context Documents" at the LLM provider level.
- Use Case: When 100 users are all asking questions about the same
Policy_Manual.pdf.
Layer 2: Semantic Cache (Query Layer)
If User A asks "How do I reset my password?" and User B asks "Steps for password reset?", their queries are semantically identical.
- Implementation: Store the
Query Embedding → Answerpair in a database (like Redis). - Logic: If a new query embedding is > 0.98 similar to a cached query, return the cached answer immediately.
# Conceptual Semantic Cache
query_vec = get_embedding(query)
cached_hit = redis_db.search(query_vec, threshold=0.98)
if cached_hit:
return cached_hit.answer
Layer 3: Embedding Cache
Generating embeddings for the same text over and over is a waste of money.
- Implementation: Cache
Hash(string) → Vector.
Layer 4: Fragment Caching
In a multimodal system, cache the OCR output or the audio transcript. These are expensive compute operations that should only happen once.
Cache Invalidation (The Hard Part)
When you update a document in your database:
- You must update the vector.
- You must clear the semantic cache for any queries that might have used that document.
- You must update the fragmented OCR/Transcript cache.
Exercises
- What is the risk of a "Semantic Cache" returning an answer to a query that has a slightly different nuance?
- How long should a RAG cache last? 1 hour? 1 day?
- Design a cache-invalidation strategy for a news-ticker RAG.