Caching and Retries: Production Hardening

In development, a small error or a slow response is a "Bug." In production, it's a Loss of Revenue. To make your LangChain app robust, you must implement Caching (to avoid paying for the same answer twice) and Retries (to handle transient network failures).

1. LLM Caching

If 1,000 users all ask "What is the capital of France?", why pay OpenAI 1,000 times? Use a cache to store the answer.

from langchain.globals import set_llm_cache
from langchain_community.cache import SQLiteCache

# This saves every request/response to a local database
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

# The first call is slow/paid. The second call is instant/free!
model.invoke("Hi")

2. Handling Retries

Network errors happen. A model might time out or return a 502. Instead of crashing the whole app, your system should "Try again" after 2 seconds.

# You can wrap your model with retry logic
model_with_retries = model.with_rety(
    stop_after_attempt=3, 
    wait_exponential_jitter=True
)

3. Visualizing the Resilience Loop

graph TD
    User[Query] --> Cache{In Cache?}
    Cache -->|Yes| Out[Return Instant Result]
    Cache -->|No| Model[Call LLM]
    Model -->|Success| Save[Save to Cache]
    Model -->|Fail| Retry{Try #2?}
    Retry -->|Yes| Model
    Retry -->|No| Fail[Return Error]

4. Why Use Semantic Caching?

Standard caching requires a Perfect Match.

"Hi" $\neq$ "Hi!" Semantic Caching (using Vector Stores) allows you to match similar questions.
"What's the weather?" $\approx$ "Tell me the weather." The cache will return the same answer for both, saving even more money.

5. Engineering Tip: When NOT to cache

Never cache data that is dynamic or sensitive.

Bad: Caching "What is the current Bitcoin price?" (It will be wrong in 5 minutes).
Bad: Caching User A's private credit card data.

Key Takeaways

Caching reduces costs and response times for redundant queries.
Retries handle temporary API outages and network glips.
Exponential backoff (waiting longer for each retry) prevents overloading the server.
Semantic Caching is the next-level optimization for modern AI apps.

Module 14 Lesson 2: Caching and Retries