Semantic Caching: Reusing Intelligence

In traditional web development, we use a cache (like Redis) to store the result of a database query. In AI development, we use Semantic Caching to store the result of an LLM query.

1. Traditional vs. Semantic

Traditional Cache: If the user asks "Hi" and then asks "Hi ", those are different keys. The cache misses.
Semantic Cache: If the user asks "What's the weather?" and then "Tell me the weather," the cache realizes the meaning is identical and returns the saved result instantly.

2. Why it matters for Agents

Cost: Identical queries cost $0.00.
Speed: Cache returns in 10ms. The LLM takes 2,000ms.
Consistency: You can ensure that common questions always receive the "Approved" company answer.

3. Visualizing the Cache Gate

graph LR
    User[Human Query] --> Emb[Embedding Model]
    Emb --> Search[Semantic Cache Search]
    Search -- Hit --> Result[Return Saved Answer]
    Search -- Miss --> Brain[LLM Brain]
    Brain --> Save[Save to Cache]
    Save --> Result

4. Setting the "Similarity Threshold"

How "Similar" should two questions be before you use the cache?

0.99: Must be almost exactly the same words. (Very Safe).
0.85: Can be different words but same intent. (Risk of serving the wrong answer).

For customer support, 0.90 - 0.95 is usually the sweet spot.

5. Tools for the Job

GPTCache: A popular Python library that integrates with LangChain to handle the database lookups and similarity math for you.
Redis (Vector Search): Redis now supports vector search, making it the fastest option for high-traffic agent systems.

Key Takeaways

Semantic Caching identifies identical intents, not just identical words.
It is the primary tool for reducing LLM costs in production.
It significantly improves user experience by providing instant answers.
Careful threshold management is required to prevent "Wrong Answer" hits.

Module 15 Lesson 2: Semantic Caching