Module 7 Lesson 1: Model Caching
·AI & LLMs

Module 7 Lesson 1: Model Caching

How Ollama handles memory. Understanding why the 'second' run is always faster than the 'first'.

Model Caching: The Speed of Persistence

Have you noticed that the very first time you ask Ollama a question after rebooting your computer, it takes 5-10 seconds to start typing? But every question after that is instant? This is due to Model Caching.

1. How Caching Works

When you run a model, Ollama performs these steps:

  1. Disk Reading: It reads the 5GB+ GGUF file from your SSD.
  2. Inference Allocation: It carves out a slice of your RAM/VRAM.
  3. Loading: It moves the model weights into that slice.

Once the weights are in RAM, they stay there. This is the Cache. As long as the model stays in RAM, Ollama doesn't have to read from the SSD again.


2. The keep_alive Setting

By default, Ollama keeps a model in your RAM for 5 minutes after you finish talking to it. If you don't send any more prompts, it "unloads" the model to free up your memory for other apps (like Chrome or Video Games).

You can change this globally or per-request:

  • Forever: keep_alive=-1 (The model never leaves RAM).
  • Instant: keep_alive=0 (The model is deleted from RAM immediately after the answer).
  • Custom: keep_alive=10m (Stay in RAM for 10 minutes).

3. Multiple Models vs. Cache

If you have 16GB of RAM and you run two 8GB models, Ollama has a choice:

  • Option A: Keep both in RAM (if they fit).
  • Option B: Kick the first one out to make room for the second one.

Checking ollama ps will show you exactly which models are currently "parked" in your memory.


4. Why This Matters for Developers

If you are building a Python script that calls Ollama every 30 minutes, the default 5-minute timeout means your script will be slow every time it runs because it has to wait for the "Disk-to-RAM" transfer.

Pro Tip: For automated background tasks, set keep_alive to at least 40 minutes or use -1 to keep the model "Hot" and ready to respond instantly.


Key Takeaways

  • Caching stores model weights in RAM for instant responses.
  • The default keep-alive time is 5 minutes.
  • Use keep_alive=-1 to prevent the model from ever being unloaded.
  • Avoid running too many distinct models simultaneously to prevent "Cache Thrashing" (constantly loading/unloading from disk).

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn