Module 15 Lesson 2: Semantic Caching
Save time, save money. Using GPTCache to avoid calling the expensive LLM for identical (or similar) queries.
Semantic Caching: Reusing Intelligence
In traditional web development, we use a cache (like Redis) to store the result of a database query. In AI development, we use Semantic Caching to store the result of an LLM query.
1. Traditional vs. Semantic
- Traditional Cache: If the user asks "Hi" and then asks "Hi ", those are different keys. The cache misses.
- Semantic Cache: If the user asks "What's the weather?" and then "Tell me the weather," the cache realizes the meaning is identical and returns the saved result instantly.
2. Why it matters for Agents
- Cost: Identical queries cost $0.00.
- Speed: Cache returns in 10ms. The LLM takes 2,000ms.
- Consistency: You can ensure that common questions always receive the "Approved" company answer.
3. Visualizing the Cache Gate
graph LR
User[Human Query] --> Emb[Embedding Model]
Emb --> Search[Semantic Cache Search]
Search -- Hit --> Result[Return Saved Answer]
Search -- Miss --> Brain[LLM Brain]
Brain --> Save[Save to Cache]
Save --> Result
4. Setting the "Similarity Threshold"
How "Similar" should two questions be before you use the cache?
- 0.99: Must be almost exactly the same words. (Very Safe).
- 0.85: Can be different words but same intent. (Risk of serving the wrong answer).
For customer support, 0.90 - 0.95 is usually the sweet spot.
5. Tools for the Job
- GPTCache: A popular Python library that integrates with LangChain to handle the database lookups and similarity math for you.
- Redis (Vector Search): Redis now supports vector search, making it the fastest option for high-traffic agent systems.
Key Takeaways
- Semantic Caching identifies identical intents, not just identical words.
- It is the primary tool for reducing LLM costs in production.
- It significantly improves user experience by providing instant answers.
- Careful threshold management is required to prevent "Wrong Answer" hits.