Load Balancing: The AI Cluster

What if you have a team of 100 people and one computer (even with 4 GPUs) isn't enough? You move from "Vertical Scaling" (one big machine) to "Horizontal Scaling" (many machines).

You can put an Nginx Load Balancer in front of three different computers, each running Ollama.

1. The Architecture

Server A (Admin): Runs Nginx.
Server B: Runs Ollama (GPU 1).
Server C: Runs Ollama (GPU 2).
Server D: Runs Ollama (GPU 3).

When a user asks a question, Nginx picks the server that is currently the "Least Busy" and sends the request there.

2. Nginx Configuration Fragment

upstream ollama_cluster {
    least_conn; # Send to the machine with fewer active connections
    server 192.168.1.10:11434;
    server 192.168.1.11:11434;
    server 192.168.1.12:11434;
}

server {
    listen 80;
    location / {
        proxy_pass http://ollama_cluster;
    }
}

3. The Challenge of "State"

If a user is having a conversation, they must keep talking to the same server. If Turn 1 is on Server B and Turn 2 is on Server C, Server C won't have the "Cache" of the conversation in its memory, and the response will be slow or incomplete.

Solution: Use "Sticky Sessions" (Session Affinity) in your load balancer so a specific User IP always goes to the same Ollama instance.

4. Heterogeneous Clusters

The beauty of this setup is that you can build a "Frankenstein" cluster.

Machine 1: An old PC with a GTX 1080.
Machine 2: A new Mac Studio.
Machine 3: A Linux server with an RTX 3090.

By load balancing them, you create a unified "Private AI Cloud" from the spare hardware you already have.

Key Takeaways

Horizontal Scaling connects multiple computers into one AI service.
Nginx is the primary tool for distributing AI traffic.
Least Connections is the best strategy for a balanced workload.
Sticky Sessions are required to maintain conversation speed (Cache persistence).

Module 13 Lesson 4: Load Balancing Local AI