Module 13 Lesson 4: Load Balancing Local AI
·AI & LLMs

Module 13 Lesson 4: Load Balancing Local AI

Going horizontal. How to use Nginx or HAProxy to distribute traffic across multiple Ollama servers.

Load Balancing: The AI Cluster

What if you have a team of 100 people and one computer (even with 4 GPUs) isn't enough? You move from "Vertical Scaling" (one big machine) to "Horizontal Scaling" (many machines).

You can put an Nginx Load Balancer in front of three different computers, each running Ollama.

1. The Architecture

  1. Server A (Admin): Runs Nginx.
  2. Server B: Runs Ollama (GPU 1).
  3. Server C: Runs Ollama (GPU 2).
  4. Server D: Runs Ollama (GPU 3).

When a user asks a question, Nginx picks the server that is currently the "Least Busy" and sends the request there.


2. Nginx Configuration Fragment

upstream ollama_cluster {
    least_conn; # Send to the machine with fewer active connections
    server 192.168.1.10:11434;
    server 192.168.1.11:11434;
    server 192.168.1.12:11434;
}

server {
    listen 80;
    location / {
        proxy_pass http://ollama_cluster;
    }
}

3. The Challenge of "State"

If a user is having a conversation, they must keep talking to the same server. If Turn 1 is on Server B and Turn 2 is on Server C, Server C won't have the "Cache" of the conversation in its memory, and the response will be slow or incomplete.

Solution: Use "Sticky Sessions" (Session Affinity) in your load balancer so a specific User IP always goes to the same Ollama instance.


4. Heterogeneous Clusters

The beauty of this setup is that you can build a "Frankenstein" cluster.

  • Machine 1: An old PC with a GTX 1080.
  • Machine 2: A new Mac Studio.
  • Machine 3: A Linux server with an RTX 3090.

By load balancing them, you create a unified "Private AI Cloud" from the spare hardware you already have.


Key Takeaways

  • Horizontal Scaling connects multiple computers into one AI service.
  • Nginx is the primary tool for distributing AI traffic.
  • Least Connections is the best strategy for a balanced workload.
  • Sticky Sessions are required to maintain conversation speed (Cache persistence).

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn