Concurrency and Throttling

When you deploy a production agent, you are no longer limited by your own typing speed. You are limited by the Infrastructure's throughput. Your agent might try to call a tool 100 times in parallel, or 10,000 users might try to talk to your agent at the same time.

Without Concurrency Control and Throttling, your system will enter a "Crash Loop" of 429 Errors (Too Many Requests). In this lesson, we will learn how to manage the velocity of your agency.

1. Concurrency: Doing Things in Parallel

LangGraph allows you to run multiple nodes at the exact same time. This is essential for reducing the latency of complex tasks.

The Fan-Out / Fan-In Pattern

One node starts the work.
It "Branches" into 5 parallel worker nodes.
All workers finish.
One "Join" node summarizes the results.

graph LR
    Start --> FanOut{Parallel Launch}
    FanOut --> A[Search A]
    FanOut --> B[Search B]
    FanOut --> C[Search C]
    A --> Join[Synthesis]
    B --> Join
    C --> Join
    Join --> End

2. Throttling: The "Speed Governor"

Even if you can run 100 agents in parallel, your API providers (OpenAI, Anthropic, Google) have Rate Limits.

Types of Limits

RPM (Requests Per Minute): How many times you can hit the API.
TPM (Tokens Per Minute): Your total "volume" allowance.
Concurrency Limit: How many active TCP connections you can have.

Implementation: The Semaphore Pattern

In your Python code, you should never call an LLM directly in a loop. Use a Semaphore or a RateLimiter to "Queue" the requests.

import asyncio

# Allow only 5 concurrent LLM calls across the whole system
sem = asyncio.Semaphore(5)

async def throttled_llm_call(prompt):
    async with sem:
        return await llm.ainvoke(prompt)

3. The "Backoff and Retry" Strategy

When you inevitably hit a rate limit (HTTP 429), how you handle it determines your system's reliability.

Bad Approach: Catch the 429 and try again immediately. (This just makes the rate limit problem worse). Good Approach: Exponential Backoff with Jitter.

Wait 1 second + random ms.
If it fails again, wait 2 seconds + random ms.
If it fails again, wait 4 seconds...

4. Prioritization: The "VIP" Queue

Not all agent tasks are equal.

A User-facing Chat is high priority. (Should jump to the front of the queue).
A Background Data Scraping task is low priority. (Can wait 5 minutes if the system is busy).

Solution: Priority Queues

Use a system like Redis to maintain multiple queues with different weights.

5. Token Throttling (Budgeting)

As we discussed in Module 3.3, tokens are a system constraint. You can implement a Token Quota per user.

"User A has used 90% of their daily token budget."
Throttle User A's agents so they run on a cheaper, "Slower" model like Haiku or GPT-4o-mini.

6. Implementation Example: LangGraph Throttling

You can use asyncio.gather within a LangGraph node to handle internal parallelism securely.

async def parallel_search_node(state):
    # Launch 3 searches at once
    tasks = [
        search_tool.ainvoke(state["q1"]),
        search_tool.ainvoke(state["q2"]),
        search_tool.ainvoke(state["q3"])
    ]
    results = await asyncio.gather(*tasks)
    return {"results": results}

Summary and Mental Model

Think of Concurrency and Throttling like a Highway.

Concurrency is adding more lanes. It allows more traffic (tasks) to move at once.
Throttling is the Toll Booth. It makes sure that even if the highway is wide, the exit point (The API or DB) doesn't get overwhelmed.

A fast highway with a single, jammed exit is a parking lot.

Exercise: Scaling Calculation

The Math: Your OpenAI limit is 10,000 Tokens Per Minute. Each agent session uses 2,000 tokens per minute.
- How many concurrent users can you support with this limit?
- How would you use a Small Model for simple steps to "reclaim" token space for more users?
Design: Why is it better to have a "Global" rate limiter for the whole app rather than an "Agent-level" rate limiter?
Logic: What happens to a "Human-in-the-loop" node (Module 5.3) when the system is throttled? Does the human have to wait, or only the agent? Ready for communication? Let's move to Inter-Agent Communication Pattern.

Velocity and Stability: Concurrency and Throttling