Scaling Concurrency: Thousands of Brains

In a hobby project, one agent runs on one laptop. In an enterprise project, 1,000 users might trigger 1,000 different agents at the exact same microsecond. If you aren't prepared for this, your application will freeze, your database will lock, and your API provider will block you.

1. The Bottlenecks

A. API Rate Limits

OpenAI and Anthropic have "Tiered" limits. Even at the highest tier, you can only send a certain number of tokens per minute.

Solution: Token Bucket rate limiting in your code. Queue requests and "drip" them to the API as capacity allows.

B. Database Locks (Stateful Agents)

If Agent A is updating the state in Postgres while the user is also trying to read it, you get a lock contention.

Solution: Use Redis for the "Hot State" (active conversations) and move data to Postgres only after the session is closed.

2. Asynchronous Workers (Celery/Temporal)

Don't run the agent loop inside your web server (FastAPI/Express). If the agent takes 30 seconds to run, your web server is "Busy" and can't help other users.

The Pattern: Web server creates a Job. A separate Worker Process picks up the job and runs the agent.

3. Visualizing the Scalar Architecture

graph LR
    User[1,000 Humans] --> API[FastAPI Gateway]
    API --> Queue[Redis / BullMQ Queue]
    subgraph Workers
    Queue --> W1[Worker Agent 1]
    Queue --> W2[Worker Agent 2]
    Queue --> W3[Worker Agent 3]
    end
    W1 --> LLM[OpenAI / Local Cluster]
    W2 --> LLM
    W3 --> LLM

4. Multi-Region Deployments

If your users are in Europe and your LLM server is in America, the latency (lag) will be high.

Deploy your "Agent Code" close to your "User."
If the model is local (Module 13), deploy a Global Cluster of GPU instances.

5. Engineering Tip: Resource Contention in Tools

If your agents use a tool like execute_sql, you must realize that a database can only handle so many connections.

The Fix: Implement a Connection Pool. Don't let 1,000 agents open 1,000 separate connections to your database at once.

Key Takeaways

Decoupling the UI from the Agent reasoning (via a queue) is the secret to scaling.
Redis is significantly better than SQL for managing "Hot" agent state.
Rate limits are a mathematical reality; you must build with "Backoff" logic.
Connection pooling for tools prevents agents from accidentally DOS-ing your own infrastructure.

Module 15 Lesson 4: Scaling Agent Concurrency