Module 13 Lesson 2: Multi-GPU Support
·AI & LLMs

Module 13 Lesson 2: Multi-GPU Support

Parallel power. How to configure Ollama to use multiple graphics cards for giant 70B models.

Multi-GPU: Breaking the Memory Barrier

A single consumer GPU usually has 8GB or 12GB of VRAM. A 70B parameter model (the "Smartest" level of open models) needs about 40GB. To run this, you need to link multiple GPUs together.

Ollama handles this automatically, but you need to know how it "splits" the brain.

1. Automatic Layer Splitting

When Ollama starts, it looks at every GPU in your system. If a model is too big for GPU #1, it will:

  1. Put 60% of the model layers on GPU #1.
  2. Put the remaining 40% on GPU #2.

While this works, it is slightly slower than a single GPU because data has to travel across the "PCIe Bus" between the cards.


2. Choosing Your GPUs

If you have three GPUs but only want Ollama to use two of them (perhaps you want to keep the third one for gaming or video editing), you use an Environment Variable:

CUDA_VISIBLE_DEVICES=0,1 ollama serve

This tells Ollama: "Only look at my first and second cards. Pretend the third one doesn't exist."


3. NVLink vs. Standard PCIe

  • Standard: The GPUs talk through the motherboard. It's affordable but has a "Speed Limit."
  • NVLink (Pro Level): A physical bridge connects the two GPUs at insane speeds. If you are building a professional AI workstation with dual RTX 3090/4090s, NVLink makes multi-GPU setups feel as fast as a single giant card.

4. The VRAM Pool

In multi-GPU setups, your VRAM adds up.

  • GPU 1 (12GB) + GPU 2 (12GB) = 24GB total VRAM.
  • This allows you to run a Q4 version of a 30B model comfortably at home.

5. Monitoring the Split

You can see how Ollama chose to split the model by looking at your server.log (Module 12). Look for lines like: "offloading 32 layers to GPU 0" "offloading 12 layers to GPU 1"

If you see "offloading 0 layers to CPU", it means everything fit on your GPUs—congratulations!


Key Takeaways

  • Ollama automatically detects and uses all available GPUs.
  • VRAM stacks: Two small GPUs act as one large GPU for memory purposes.
  • Use CUDA_VISIBLE_DEVICES to control which hardware Ollama accesses.
  • Server logs reveal exactly how the model is distributed across your hardware.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn