Caching Strategies in RAG

"The fastest query is the one you never had to run." Caching is the key to scaling RAG to thousands of concurrent users.

Layer 1: Prompt Caching (Model Layer)

As discussed in Module 16, this caches the "System Prompt" and "Context Documents" at the LLM provider level.

Use Case: When 100 users are all asking questions about the same Policy_Manual.pdf.

Layer 2: Semantic Cache (Query Layer)

If User A asks "How do I reset my password?" and User B asks "Steps for password reset?", their queries are semantically identical.

Implementation: Store the Query Embedding → Answer pair in a database (like Redis).
Logic: If a new query embedding is > 0.98 similar to a cached query, return the cached answer immediately.

# Conceptual Semantic Cache
query_vec = get_embedding(query)
cached_hit = redis_db.search(query_vec, threshold=0.98)
if cached_hit:
   return cached_hit.answer

Layer 3: Embedding Cache

Generating embeddings for the same text over and over is a waste of money.

Implementation: Cache Hash(string) → Vector.

Layer 4: Fragment Caching

In a multimodal system, cache the OCR output or the audio transcript. These are expensive compute operations that should only happen once.

Cache Invalidation (The Hard Part)

When you update a document in your database:

You must update the vector.
You must clear the semantic cache for any queries that might have used that document.
You must update the fragmented OCR/Transcript cache.

Exercises

What is the risk of a "Semantic Cache" returning an answer to a query that has a slightly different nuance?
How long should a RAG cache last? 1 hour? 1 day?
Design a cache-invalidation strategy for a news-ticker RAG.