Load Balancing Long-Running Agents

In a standard web app, a load balancer (like AWS ALB or Nginx) distributes requests like cards being dealt from a deck: "Server 1, Server 2, Server 3, repeat." This works because every request is more or less the same (e.g., getting a profile page).

In Agentic systems, this "Round Robin" strategy fails.

Request A is a simple "Hello." (Takes 50ms).
Request B is "Audit this 500-page PDF." (Takes 5 minutes).

If Server 1 gets 10 "PDF" requests while Server 2 gets 10 "Hello" requests, Server 1 will crash while Server 2 is idle. In this lesson, we will learn Least-Connection and Resource-Aware load balancing.

1. The Strategy: "Least Outstanding Requests"

Instead of "Round Robin," the load balancer keeps track of how many active agent threads each server is currently handling.

Rule: Send the next request to the server with the Fewest active connections.

2. Resource-Aware Routing

A server on your cluster might be busy not because of "Users," but because of CPU/Memory usage.

If Server 1 is running an agent that is currently executing a Python script, its CPU might be at 100%.
A "Smart" Load Balancer (like HAProxy or a custom K8s controller) checks the Health Metrics of the target server before sending another agent task.

3. Handling the "Long Tail" of Latency

Most agents finish in 10 seconds. 5% of agents take 10 minutes. If you use a standard Load Balancer, it will "Timeout" after 60 seconds and drop the connection.

The solution: The De-coupled Proxy

Frontend connects to a Websocket Proxy.
The Proxy says "I'll keep this connection open for 1 hour."
The Backend agent worker finishes the task (10 minutes later) and pushes the result to a Redis channel.
The Proxy reads the channel and sends the data back to the user.

4. Multi-Region Load Balancing

If your LLM API (OpenAI) has a rate limit per region, you can gain more throughput by spreading your agents across multiple AWS regions (us-east-1, eu-west-1).

The Global Dispatcher:

A central "Dispatcher Node" tracks your Remaining Token Quota for each region.

"OpenAI us-east-1 is reached 100% capacity."
"Direct the next 500 agents to the eu-west-1 cluster."

5. Circuit Breakers

What happens when an agent starts "Hallucinating" at scale?

If the Error Rate for a specific version of your agent (Module 16.4) exceeds 5%, the Load Balancer should "Trip the Circuit."
This stops traffic to that version and falls back to a "Basic" agent or a "Maintenance" message, preventing the reputational damage of an "Insane" agent talking to thousands of users.

6. Implementation Strategy: Weighted Routing

You can use Weights to slowly roll out new models.

Llama 3 (Local): Weight 90. (Cheap).
GPT-4o (Cloud): Weight 10. (Expensive). This ensures that 90% of your massive traffic is handled by your "Cost-Effective" hardware, while 10% stays on the "Premium" brain.

Summary and Mental Model

Think of Load Balancing like Directing Traffic at a Toll Booth.

If you just open every lane equally, you'll get a jam behind a truck (Long-running task).
If you have a Traffic Controller (The Smart Balancer) who points trucks to the dedicated "Heavy" lane and cars to the "Fast" lane, everyone moves faster.

Exercise: Routing Logic

Selection: You have two servers:
- Server A: 8 CPU cores, 16GB RAM.
- Server B: 64 CPU cores, 128GB RAM.
- How would you set your Weights in the load balancer to ensure the "Heavy" agents go to Server B?
Persistence: What happens to a "Sticky Session" (Module 18.2) if the server it is stuck to Reboots?
- How do you handle the "Handover"?
Logic: Why is Websocket scaling harder than REST API scaling?
- (Hint: Look up "Number of open file descriptors"). Ready for the final push? Next lesson: Caching and Performance at Scale.

The Agent Traffic Jam: Load Balancing