
Token Economics: Managing Context and Constraints
Master the technical limitations of tokens in agentic systems. Learn strategies for summarization, message pruning, and context management to prevent system overflow.
Token Usage as a System Constraint
Tokens are the "Oxygen" of an AI agent. They are the fundamental unit of consumption and the primary limiting factor of your system's intelligence and memory. Unlike traditional software where memory is measured in Gigabytes, in Agentic Engineering, we measure in Context Window Tokens.
In this lesson, we will look at how tokens constrain your agentic architecture and how to build systems that scale without hitting the "Token Wall."
1. The Context Window Barrier
Every LLM has a hard limit on how many tokens it can process at one time.
- Claude 3.5 Sonnet: 200,000 tokens.
- GPT-4o: 128,000 tokens.
- Gemini 1.5 Pro: 2,000,000 tokens.
While these numbers sound large, an agentic loop consumes them at an alarming rate.
The Problem: Token Bloat
Each time the agent takes a step in a LangGraph workflow, the entire state (history + tools + reasoning) is sent back to the model.
- Step 1: 1,000 tokens.
- Step 2: 2,500 tokens (Includes Step 1).
- Step 3: 4,200 tokens (Includes Step 1 and 2).
- Total Consumption: 7,700 tokens across 3 turns.
2. Managing Chat History: The "State" Cleanup
To prevent an agent from becoming a "Zombie" (too much context, no intelligence), you must implement Memory Management.
Strategy 1: Message Pruning
Keep only the last $N$ messages.
- Pro: Minimal latency, predictable token usage.
- Con: The agent "forgets" the beginning of the task.
Strategy 2: Summarization (Recursive Memory)
When the message history exceeds a threshold (e.g., 20 messages), trigger a "Summary Node."
- A separate LLM call summarizes the history into 500 tokens.
- The old messages are deleted.
- The summary is added as a "System Message" at the top of the new context.
graph LR
History[30 Messages] -->|Too Long| Node[Summarizer]
Node -->|Reduced| NewHistory[1 Summary + 2 New Messages]
Strategy 3: Dynamic Compression
Use a tool like tiktoken to count tokens in real-time. If you are near the limit, remove non-essential metadata or tool outputs from the history.
3. Tool Output Constraints (The "Sewer" of Tokens)
Tool outputs are often the biggest contributors to context overflow.
- The mistake: An agent calls a
web_searchtool, which returns 50,000 tokens of raw HTML. - The result: The agent immediately hits its context limit or becomes incredibly slow.
The Solution: Filtering tools
Never allow a tool to return raw data to the agent's main state.
- Bad:
return html_body - Good:
return extract_summary(html_body) - Best:
return [List of relevant snippets with IDs]
4. Input vs. Output Token Asymmetry
Models have different pricing and performance profile for Input vs. Output.
- Input (Prompting): Fast, cheaper, but bounded by context window.
- Output (Generation): Slow, 5-10x more expensive, and bounded by a smaller output limit (usually 4k-8k tokens).
Implication for Agents: Design your agent to think in Short Bursts. Instead of asking an agent to "Write a full book" (which will fail due to output limits), have it "Write Chapter 1" -> Save -> "Write Chapter 2".
5. Estimating Cost: The Unit Economics
Before deploying an agent, you must calculate its Unit Margin.
| Feature | Average Tokens |
|---|---|
| System Prompt | 1,000 |
| Tool Definitions | 1,500 |
| Per User Message | 200 |
| Per Agent Step | 800 |
| Session (10 turns) | ~50,000 - 150,000 cumulative |
If a session costs $0.50 to $1.50 in tokens, can your business charge enough to make a profit?
6. Token Guardrails in LangGraph
We will implement a MaxTurns constraint in every graph. This is the ultimate safety switch.
def check_recursion_limit(state: AgentState):
if len(state["steps_taken"]) > 10:
return "ERROR: Recurison Limit Exceeded"
return "CONTINUE"
Summary and Mental Model
Tokens are like Memory Slots in an old console. You have a limited number. If you fill them with "junk" (long tool outputs), you don't have room for "thinking" (reasoning).
An efficient agent engineer is a minimalist.
In the next lesson, we will look at Guardrails and Failure Handling—the mechanisms we use to stop an agent when it starts wasting its precious "Oxygen" (Tokens) on a failing path.
Exercise: Token Optimization
- Pruning Design: A user is having a technical support conversation.
- If you prune the first 5 messages, what critical info might the agent forget? (Address? Device ID?)
- How would you use a "Fixed Metadata" field in the
Stateto keep that info forever while pruning the chat messages?
- The Calculation: A model has a 128k context window. Each turn adds 5,000 tokens.
- At which turn will the model hit its limit?
- How many turns can it sustain if you use "Summarization" to cap the history at 20,000 tokens?
- Tool Safety: Write a Python function that takes a string of any length and returns only the first 500 words for the agent to read.