Context Management: Sliding Windows vs. Summary Windows

When a conversation or an agent reasoning loop goes on for a long time, the context grows linearly. If left unchecked, you will eventually hit "The Wall" (Module 1.3) or go broke (Module 1.4).

To prevent this, you must implement a Memory Strategy.

In this lesson, we compare the two industry-standard strategies for memory management: Sliding Windows and Summary Windows. We will learn the technical implementation of each and when to choose "Verbatim Record" over "Semantic Essence."

1. The Sliding Window (Fixed Memory)

A sliding window maintains only the most recent $N$ messages or $T$ tokens. Older messages are simply "forgotten" (deleted from the prompt).

Pros: Zero computational overhead. Perfect "Primacy/Recency" performance.
Cons: Total amnesia regarding the beginning of the chat. The model "forgets" its own introduction.
Best For: Customer support bots where only the current problem matters.

graph LR
    subgraph "Full History"
        M1[Msg 1]
        M2[Msg 2]
        M3[Msg 3]
        M4[Msg 4]
    end
    
    M2 & M3 & M4 --> SW[Sliding Window: Last 3]
    M1 -.->|DISCARDED| TRASH

2. The Summary Window (Semantic Memory)

A summary window takes the older messages and uses a cheap model (like Haiku or Llama 3 405B) to compress them into a few bullet points.

Pros: Preserves important context across thousands of turns.
Cons: Adds latency (requires an extra LLM call). Risk of "Summarization Loss."
Best For: Long-running creative projects, legal analysis, and long-term personal assistants.

3. Implementation: The Hybrid Memory (Python)

Senior AI engineers rarely use just one. They use a Hybrid Memory Manager.

Python Code: The Token-Aware Hybrid Manager

import tiktoken

class HybridMemory:
    def __init__(self, limit=4000):
        self.limit = limit
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
        self.history = []
        self.summary = ""

    def add_message(self, msg):
        self.history.append(msg)
        total_tokens = sum([len(self.tokenizer.encode(m)) for m in self.history])
        
        if total_tokens > self.limit:
            self.consolidate()

    def consolidate(self):
        """
        Take the first half of history, summarize it, 
        and keep the second half verbatim.
        """
        to_summarize = self.history[:len(self.history)//2]
        self.history = self.history[len(self.history)//2:]
        
        # In production: call_cheap_model(to_summarize)
        self.summary += "\n[SUMMARY of previous turns: ...]"
        print("Condensed memory to save tokens.")

4. Comparing the Token ROI

Feature	Sliding Window	Summary Window
Token Cost	Fixed (Low)	Fixed + Summary (Low)
Setup Cost	Zero	Extra LLM Call per N turns
UX Feel	"Fast but Forgetful"	"Slower but Smart"

Architectural Tip: If your summary call costs $0.0001 but saves $0.01 per subsequent query for the next 10 turns, your ROI is 100x. Summarization is almost always a financial win in long conversations.

5. Management in multi-agent Systems (LangGraph)

In LangGraph, you can use a "Checkpointer" to save the state.

The Caching-First Agent:

Every 10 steps, the agent calls a "Reflection Node."
The Reflection node takes the raw History and writes a Status Update to the State.
The raw History is then truncated.
Future agents only see the Status Update, keeping the context window thin and efficient.

6. Summary and Key Takeaways

Sliding Windows: Good for speed and "Now-relevant" tasks.
Summary Windows: Good for continuity and complex, long-term state.
Hybrid is Best: Summarize the distant past, keep the recent past verbatim.
Token Caps: Always have a hard limit (e.g., 4k tokens) before triggering memory cleanup.

In the next lesson, Selection and Pruning Strategies, we learn the algorithms for deciding which specific sentences are worth keeping and which are trash.

Exercise: The Memory Budgeter

You have a conversation with 50 messages. Each message is 200 tokens.
Total Tokens: 10,000.
Plan a memory strategy that keeps the Total Token Count per message below 2,000.
How many verbatim messages can you keep?
How many words should your summary be to fit the remaining budget?

Context Management: Sliding Windows vs. Summary Windows

Context Management: Sliding Windows vs. Summary Windows

1. The Sliding Window (Fixed Memory)

2. The Summary Window (Semantic Memory)

3. Implementation: The Hybrid Memory (Python)

Python Code: The Token-Aware Hybrid Manager

4. Comparing the Token ROI

5. Management in multi-agent Systems (LangGraph)

6. Summary and Key Takeaways

Exercise: The Memory Budgeter

Congratulations on completing Module 6 Lesson 1! Your agents now have superior memory management.

Subscribe to our newsletter