
Context Management: Sliding Windows vs. Summary Windows
Learn the two primary strategies for managing long conversations. Master the art of 'Context Truncation' and 'Semantic Compression' to keep your agent's memory lean and focused.
Context Management: Sliding Windows vs. Summary Windows
When a conversation or an agent reasoning loop goes on for a long time, the context grows linearly. If left unchecked, you will eventually hit "The Wall" (Module 1.3) or go broke (Module 1.4).
To prevent this, you must implement a Memory Strategy.
In this lesson, we compare the two industry-standard strategies for memory management: Sliding Windows and Summary Windows. We will learn the technical implementation of each and when to choose "Verbatim Record" over "Semantic Essence."
1. The Sliding Window (Fixed Memory)
A sliding window maintains only the most recent $N$ messages or $T$ tokens. Older messages are simply "forgotten" (deleted from the prompt).
- Pros: Zero computational overhead. Perfect "Primacy/Recency" performance.
- Cons: Total amnesia regarding the beginning of the chat. The model "forgets" its own introduction.
- Best For: Customer support bots where only the current problem matters.
graph LR
subgraph "Full History"
M1[Msg 1]
M2[Msg 2]
M3[Msg 3]
M4[Msg 4]
end
M2 & M3 & M4 --> SW[Sliding Window: Last 3]
M1 -.->|DISCARDED| TRASH
2. The Summary Window (Semantic Memory)
A summary window takes the older messages and uses a cheap model (like Haiku or Llama 3 405B) to compress them into a few bullet points.
- Pros: Preserves important context across thousands of turns.
- Cons: Adds latency (requires an extra LLM call). Risk of "Summarization Loss."
- Best For: Long-running creative projects, legal analysis, and long-term personal assistants.
3. Implementation: The Hybrid Memory (Python)
Senior AI engineers rarely use just one. They use a Hybrid Memory Manager.
Python Code: The Token-Aware Hybrid Manager
import tiktoken
class HybridMemory:
def __init__(self, limit=4000):
self.limit = limit
self.tokenizer = tiktoken.get_encoding("cl100k_base")
self.history = []
self.summary = ""
def add_message(self, msg):
self.history.append(msg)
total_tokens = sum([len(self.tokenizer.encode(m)) for m in self.history])
if total_tokens > self.limit:
self.consolidate()
def consolidate(self):
"""
Take the first half of history, summarize it,
and keep the second half verbatim.
"""
to_summarize = self.history[:len(self.history)//2]
self.history = self.history[len(self.history)//2:]
# In production: call_cheap_model(to_summarize)
self.summary += "\n[SUMMARY of previous turns: ...]"
print("Condensed memory to save tokens.")
4. Comparing the Token ROI
| Feature | Sliding Window | Summary Window |
|---|---|---|
| Token Cost | Fixed (Low) | Fixed + Summary (Low) |
| Setup Cost | Zero | Extra LLM Call per N turns |
| UX Feel | "Fast but Forgetful" | "Slower but Smart" |
Architectural Tip: If your summary call costs $0.0001 but saves $0.01 per subsequent query for the next 10 turns, your ROI is 100x. Summarization is almost always a financial win in long conversations.
5. Management in multi-agent Systems (LangGraph)
In LangGraph, you can use a "Checkpointer" to save the state.
The Caching-First Agent:
- Every 10 steps, the agent calls a "Reflection Node."
- The Reflection node takes the raw
Historyand writes aStatus Updateto theState. - The raw
Historyis then truncated. - Future agents only see the
Status Update, keeping the context window thin and efficient.
6. Summary and Key Takeaways
- Sliding Windows: Good for speed and "Now-relevant" tasks.
- Summary Windows: Good for continuity and complex, long-term state.
- Hybrid is Best: Summarize the distant past, keep the recent past verbatim.
- Token Caps: Always have a hard limit (e.g., 4k tokens) before triggering memory cleanup.
In the next lesson, Selection and Pruning Strategies, we learn the algorithms for deciding which specific sentences are worth keeping and which are trash.
Exercise: The Memory Budgeter
- You have a conversation with 50 messages. Each message is 200 tokens.
- Total Tokens: 10,000.
- Plan a memory strategy that keeps the Total Token Count per message below 2,000.
- How many verbatim messages can you keep?
- How many words should your summary be to fit the remaining budget?