Architectural Design for Caching-First Apps

The introduction of Prompt Caching requires a fundamental shift in how we build AI backends. In the old world, we built "Prompts." In the new world, we build "Layered Blocks."

To maximize the 90% discount, you cannot simply dump strings into an API. You must architect your systems to ensure that the "Static" parts of your instructions are always separated from the "Dynamic" parts of your query.

In this lesson, we learn the "Layered Block" architecture, how to handle "Dynamic Injections," and how to synchronize your cache strategy between the Backend (FastAPI) and the Frontend (React).

1. The Layered Block Architecture

Think of your prompt as a Stack of Bricks.

Layer	Type	Change Rate	Caching Strategy
Foundation	System Rules	Monthly	Always Cache
Knowledge	RAG Docs / PDFs	Weekly	Cache (Session-based)
Context	Last 5 messages	Hourly	Cache (Ephemeral)
Execution	The User Query	Instant	Never Cache

2. Rule #1: The Prefix Invariant

If you change a single character at the beginning of your prompt, the cache for the entire rest of the prompt is invalidated.

Bad Architecture: [User Name] + [System Prompt] + [Large Data] (Every new user breaks the cache for the system prompt).

Good Architecture: [System Prompt] + [Large Data] + [User Name] (The foundation remains cached, and only the tail changes).

3. Handling Dynamic Data (The Injection Point)

Sometimes you must have dynamic data (like a date or a user ID) inside your instructions.

The Solution: Move all dynamic metadata to the End of the Input.

graph TD
    S[System Instructions: Cached] --> D[RAG Data: Cached]
    D --> U[User Specifics: NEW]
    U --> Q[The Question: NEW]
    
    style S fill:#69f
    style D fill:#69f
    style U fill:#f66
    style Q fill:#f66

4. Implementation: The Immutable Dispatcher (FastAPI)

In a caching-native app, you create "Immutable Classes" for your prompt fragments. This prevents accidental variability (e.g., adding a space or a newline that breaks the cache).

Python Code: The Immutable Prompt Layer

from pydantic import BaseModel

class PromptLayer(BaseModel):
    content: str
    is_cached: bool = True

class CachingDispatcher:
    def assemble(self, layers: list[PromptLayer], user_query: str):
        # We ensure the cached layers are always FIRST
        # and they are never modified by the assembly logic
        sorted_layers = sorted(layers, key=lambda x: not x.is_cached)
        
        system_block = "\n".join([l.content for l in sorted_layers])
        
        return {
            "system": system_block,
            "user": user_query
        }

5. Caching and Multi-Agent Orchestration (LangGraph)

In LangGraph, you have many steps. Each step usually involves the same "Plan" or "Description of Tools."

Strategy: Cache the Tool Definitions at the top of the system prompt for every node in the graph. Even as the "Agent State" changes, the description of how to use a calculator or a database remains cached, saving thousands of tokens across the agent's lifetime.

6. The "Cache-Aware" UI (React)

You can inform the user that their current session is "Optimized."

const CacheStatusIndicator = ({ hitRate }) => {
  return (
    <div className="flex items-center gap-2 text-xs">
      <div className={`w-2 h-2 rounded-full ${hitRate > 0.8 ? 'bg-green-500' : 'bg-yellow-500'}`} />
      <span className="text-slate-400">
        {hitRate > 0.8 ? 'Efficiency Active (90% Savings)' : 'Initializing Context...'}
      </span>
    </div>
  );
};

This builds user trust: they know that asking long, complex follow-up questions is now more efficient than starting a new chat every time.

7. Summary and Key Takeaways

Think in Layers: Segregate your prompt into Static, Semi-Static, and Dynamic blocks.
Order Matters: Cached blocks must ALWAYS come before dynamic blocks.
Immutability: Don't let your code "tinker" with the strings of your system prompt.
Agentic Consistency: Reuse tool definitions and global rules across all agents to maximize shared hits.

Exercise: The Architect's Refactor

Take a prompt that currently looks like this: "Today is {DATE}. You are helping {USER_NAME}. Here is the file: {FILE_CONTENT}. Question: {QUERY}"
Refactor it for maximum caching efficiency.

(Hint: Move {DATE} and {USER_NAME} to the bottom. Keep {FILE_CONTENT} and the System Identity at the top).
If the file is 10,000 tokens, what is the Cost Difference per 10 questions before and after the refactor?

Architectural Design for Caching-First Apps: Thinking in Blocks