Module 13 Lesson 3: Concurrency and Parallelism
·AI & LLMs

Module 13 Lesson 3: Concurrency and Parallelism

Serving the crowd. How to configure Ollama to handle multiple concurrent user requests.

Concurrency: Multi-User Local AI

By default, Ollama is "Single Threaded." If User A is asking a long question, User B has to wait in a queue for User A to finish. This is unacceptable if you are building an AI tool for a team.

Starting with Ollama 0.1.33, you can unlock Parallelism.

1. OLLAMA_NUM_PARALLEL

This environment variable tells Ollama how many "Workers" to keep active.

If you set OLLAMA_NUM_PARALLEL=4:

  • Ollama will carve your VRAM into 4 discrete slots.
  • 4 separate prompts can be processed at the exact same time.
  • The Cost: You need 4x the VRAM for the "KV Cache" (Short term memory) of the model.

2. OLLAMA_MAX_LOADED_MODELS

What if User A wants Llama 3 but User B wants Mistral? By default, Ollama might try to "Swap" them, which is slow. Set OLLAMA_MAX_LOADED_MODELS=2 to keep both models "Hot" in RAM at the same time.


3. The Performance Trade-off

Parallelism isn't free.

  • Speed: If 4 people are chatting, each person will get tokens at 1/4th the speed of a single user.
  • Memory: You might run out of VRAM faster.

Recommendation: Only use parallelism if you have a massive amount of VRAM (24GB+) or are using very small models (3B or 1B).


4. How to set it (Linux/Windows)

  • Linux (systemd): systemctl edit ollama.service Add Environment="OLLAMA_NUM_PARALLEL=4"
  • Windows: Set a System Environment Variable through the Control Panel.

5. Identifying Bottlenecks

If your num_parallel is high but users are still reporting slowness, check your CPU and RAM. Running multiple requests increases the "Pre-fill" phase, which is heavy on the CPU and memory bandwidth.


Key Takeaways

  • Ollama can handle Parallel Requests with the correct configuration.
  • OLLAMA_NUM_PARALLEL determines how many users can chat simultaneously.
  • OLLAMA_MAX_LOADED_MODELS allows multiple different models to stay in RAM.
  • Each parallel slot requires additional VRAM for the context window.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn