Token Usage as a System Constraint

Tokens are the "Oxygen" of an AI agent. They are the fundamental unit of consumption and the primary limiting factor of your system's intelligence and memory. Unlike traditional software where memory is measured in Gigabytes, in Agentic Engineering, we measure in Context Window Tokens.

In this lesson, we will look at how tokens constrain your agentic architecture and how to build systems that scale without hitting the "Token Wall."

1. The Context Window Barrier

Every LLM has a hard limit on how many tokens it can process at one time.

Claude 3.5 Sonnet: 200,000 tokens.
GPT-4o: 128,000 tokens.
Gemini 1.5 Pro: 2,000,000 tokens.

While these numbers sound large, an agentic loop consumes them at an alarming rate.

The Problem: Token Bloat

Each time the agent takes a step in a LangGraph workflow, the entire state (history + tools + reasoning) is sent back to the model.

Step 1: 1,000 tokens.
Step 2: 2,500 tokens (Includes Step 1).
Step 3: 4,200 tokens (Includes Step 1 and 2).
Total Consumption: 7,700 tokens across 3 turns.

2. Managing Chat History: The "State" Cleanup

To prevent an agent from becoming a "Zombie" (too much context, no intelligence), you must implement Memory Management.

Strategy 1: Message Pruning

Keep only the last $N$ messages.

Pro: Minimal latency, predictable token usage.
Con: The agent "forgets" the beginning of the task.

Strategy 2: Summarization (Recursive Memory)

When the message history exceeds a threshold (e.g., 20 messages), trigger a "Summary Node."

A separate LLM call summarizes the history into 500 tokens.
The old messages are deleted.
The summary is added as a "System Message" at the top of the new context.

graph LR
    History[30 Messages] -->|Too Long| Node[Summarizer]
    Node -->|Reduced| NewHistory[1 Summary + 2 New Messages]

Strategy 3: Dynamic Compression

Use a tool like tiktoken to count tokens in real-time. If you are near the limit, remove non-essential metadata or tool outputs from the history.

3. Tool Output Constraints (The "Sewer" of Tokens)

Tool outputs are often the biggest contributors to context overflow.

The mistake: An agent calls a web_search tool, which returns 50,000 tokens of raw HTML.
The result: The agent immediately hits its context limit or becomes incredibly slow.

The Solution: Filtering tools

Never allow a tool to return raw data to the agent's main state.

Bad: return html_body
Good: return extract_summary(html_body)
Best: return [List of relevant snippets with IDs]

4. Input vs. Output Token Asymmetry

Models have different pricing and performance profile for Input vs. Output.

Input (Prompting): Fast, cheaper, but bounded by context window.
Output (Generation): Slow, 5-10x more expensive, and bounded by a smaller output limit (usually 4k-8k tokens).

Implication for Agents: Design your agent to think in Short Bursts. Instead of asking an agent to "Write a full book" (which will fail due to output limits), have it "Write Chapter 1" -> Save -> "Write Chapter 2".

5. Estimating Cost: The Unit Economics

Before deploying an agent, you must calculate its Unit Margin.

Feature	Average Tokens
System Prompt	1,000
Tool Definitions	1,500
Per User Message	200
Per Agent Step	800
Session (10 turns)	~50,000 - 150,000 cumulative

If a session costs $0.50 to $1.50 in tokens, can your business charge enough to make a profit?

6. Token Guardrails in LangGraph

We will implement a MaxTurns constraint in every graph. This is the ultimate safety switch.

def check_recursion_limit(state: AgentState):
    if len(state["steps_taken"]) > 10:
        return "ERROR: Recurison Limit Exceeded"
    return "CONTINUE"

Summary and Mental Model

Tokens are like Memory Slots in an old console. You have a limited number. If you fill them with "junk" (long tool outputs), you don't have room for "thinking" (reasoning).

An efficient agent engineer is a minimalist.

In the next lesson, we will look at Guardrails and Failure Handling—the mechanisms we use to stop an agent when it starts wasting its precious "Oxygen" (Tokens) on a failing path.

Exercise: Token Optimization

Pruning Design: A user is having a technical support conversation.
- If you prune the first 5 messages, what critical info might the agent forget? (Address? Device ID?)
- How would you use a "Fixed Metadata" field in the State to keep that info forever while pruning the chat messages?
The Calculation: A model has a 128k context window. Each turn adds 5,000 tokens.
- At which turn will the model hit its limit?
- How many turns can it sustain if you use "Summarization" to cap the history at 20,000 tokens?
Tool Safety: Write a Python function that takes a string of any length and returns only the first 500 words for the agent to read.

Token Economics: Managing Context and Constraints