Module 7 Lesson 3: RAM and VRAM Optimization
·AI & LLMs

Module 7 Lesson 3: RAM and VRAM Optimization

Squeezing every drop of performance. How to force Ollama to use the GPU and manage shared memory.

RAM/VRAM Optimization: High-Speed AI

As we learned in Module 1, VRAM (Video RAM) is the "Highway" for AI. If your model fits in VRAM, it's fast. If it spills into System RAM, it's slow. Here is how to ensure your Highway is always clear.

1. Monitoring Your Usage

To see if Ollama is actually using your GPU, use these OS-specific tools while the model is typing:

  • Windows: Task Manager > Performance > GPU > "Dedicated Video Memory."
  • macOS: Activity Monitor > Window > GPU History.
  • Linux: nvidia-smi (for NVIDIA) or radeontop (for AMD).

2. Forcing GPU Offloading

Normally, Ollama automatically decides how many "Layers" of the model go to the GPU. Sometimes, it is too conservative. You can force it to try and put everything on the GPU using a Modelfile (Module 5):

FROM llama3
PARAMETER num_gpu 99

Setting num_gpu to a very high number (like 99) tells Ollama: "Try to put every single layer on the GPU."


3. The "Memory Leakers" (Other Apps)

The most common reason Ollama becomes slow is that other apps are using your VRAM.

  • Web Browsers: Chrome (with many tabs) can use 1GB to 2GB of VRAM for hardware acceleration.
  • Video Editors: Apps like DaVinci Resolve or Premiere will grab every available bit of VRAM.
  • Games: Don't try to run a 70B model while playing Cyberpunk 2077.

Optimization Tip: If you are doing serious AI work, close your browser and any graphical apps. You will see an immediate boost in tokens-per-second.


4. Flash Attention (Advanced)

If you have a modern NVIDIA GPU (RTX 30 series or 40 series), Ollama can use Flash Attention. This is a math trick that makes the model use significantly less VRAM for long conversations. Ollama enables this automatically on supported hardware, but check your logs to ensure the line "flash_attn=true" appears.


Key Takeaways

  • VRAM is the limiting factor for local AI speed.
  • Use system tools to verify if Ollama is hitting your GPU.
  • Closing GPU-heavy apps (Browsers, Games) frees up space for your model.
  • Use the num_gpu parameter in a Modelfile to manually override Ollama's auto-detection if it feels too slow.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn