
Velocity and Stability: Concurrency and Throttling
Balance speed with reliability. Learn how to manage high-volume agentic requests without triggering API rate limits or crashing your infrastructure.
Concurrency and Throttling
When you deploy a production agent, you are no longer limited by your own typing speed. You are limited by the Infrastructure's throughput. Your agent might try to call a tool 100 times in parallel, or 10,000 users might try to talk to your agent at the same time.
Without Concurrency Control and Throttling, your system will enter a "Crash Loop" of 429 Errors (Too Many Requests). In this lesson, we will learn how to manage the velocity of your agency.
1. Concurrency: Doing Things in Parallel
LangGraph allows you to run multiple nodes at the exact same time. This is essential for reducing the latency of complex tasks.
The Fan-Out / Fan-In Pattern
- One node starts the work.
- It "Branches" into 5 parallel worker nodes.
- All workers finish.
- One "Join" node summarizes the results.
graph LR
Start --> FanOut{Parallel Launch}
FanOut --> A[Search A]
FanOut --> B[Search B]
FanOut --> C[Search C]
A --> Join[Synthesis]
B --> Join
C --> Join
Join --> End
2. Throttling: The "Speed Governor"
Even if you can run 100 agents in parallel, your API providers (OpenAI, Anthropic, Google) have Rate Limits.
Types of Limits
- RPM (Requests Per Minute): How many times you can hit the API.
- TPM (Tokens Per Minute): Your total "volume" allowance.
- Concurrency Limit: How many active TCP connections you can have.
Implementation: The Semaphore Pattern
In your Python code, you should never call an LLM directly in a loop. Use a Semaphore or a RateLimiter to "Queue" the requests.
import asyncio
# Allow only 5 concurrent LLM calls across the whole system
sem = asyncio.Semaphore(5)
async def throttled_llm_call(prompt):
async with sem:
return await llm.ainvoke(prompt)
3. The "Backoff and Retry" Strategy
When you inevitably hit a rate limit (HTTP 429), how you handle it determines your system's reliability.
Bad Approach: Catch the 429 and try again immediately. (This just makes the rate limit problem worse). Good Approach: Exponential Backoff with Jitter.
- Wait 1 second + random ms.
- If it fails again, wait 2 seconds + random ms.
- If it fails again, wait 4 seconds...
4. Prioritization: The "VIP" Queue
Not all agent tasks are equal.
- A User-facing Chat is high priority. (Should jump to the front of the queue).
- A Background Data Scraping task is low priority. (Can wait 5 minutes if the system is busy).
Solution: Priority Queues
Use a system like Redis to maintain multiple queues with different weights.
5. Token Throttling (Budgeting)
As we discussed in Module 3.3, tokens are a system constraint. You can implement a Token Quota per user.
- "User A has used 90% of their daily token budget."
- Throttle User A's agents so they run on a cheaper, "Slower" model like Haiku or GPT-4o-mini.
6. Implementation Example: LangGraph Throttling
You can use asyncio.gather within a LangGraph node to handle internal parallelism securely.
async def parallel_search_node(state):
# Launch 3 searches at once
tasks = [
search_tool.ainvoke(state["q1"]),
search_tool.ainvoke(state["q2"]),
search_tool.ainvoke(state["q3"])
]
results = await asyncio.gather(*tasks)
return {"results": results}
Summary and Mental Model
Think of Concurrency and Throttling like a Highway.
- Concurrency is adding more lanes. It allows more traffic (tasks) to move at once.
- Throttling is the Toll Booth. It makes sure that even if the highway is wide, the exit point (The API or DB) doesn't get overwhelmed.
A fast highway with a single, jammed exit is a parking lot.
Exercise: Scaling Calculation
- The Math: Your OpenAI limit is 10,000 Tokens Per Minute. Each agent session uses 2,000 tokens per minute.
- How many concurrent users can you support with this limit?
- How would you use a Small Model for simple steps to "reclaim" token space for more users?
- Design: Why is it better to have a "Global" rate limiter for the whole app rather than an "Agent-level" rate limiter?
- Logic: What happens to a "Human-in-the-loop" node (Module 5.3) when the system is throttled? Does the human have to wait, or only the agent? Ready for communication? Let's move to Inter-Agent Communication Pattern.