Module 7 Lesson 4: Context Window Tuning
·AI & LLMs

Module 7 Lesson 4: Context Window Tuning

Stability over scope. Why lowering your context window can actually make your AI feel faster and more stable.

Context Window Tuning: Less is More

In Module 4, we learned that the Context Window is the AI's "Short-Term Memory." While it's tempting to set this to 100,000 words, doing so can kill your performance.

Here is how to tune num_ctx for maximum stability.

1. The VRAM-to-Context Equation

Every token in the context window consumes VRAM.

  • Small Context (2,048): Very light. The model is very stable even on 8GB RAM.
  • Medium Context (8,192): Standard. Fits on most modern GPUs.
  • Large Context (32,768+): Heavy. Might push a 12GB GPU into the "Slow Zone" (System RAM).

2. Why Tune Down?

If you are building a simple chat bot or a translation tool, you do not need 8,000 tokens of memory. By setting PARAMETER num_ctx 2048 in your Modelfile:

  1. Lower VRAM usage: You might be able to run a "smarter" model (like 14B instead of 8B) because you saved VRAM on the context window.
  2. Faster TTFT: The model spent less time "pre-calculating" the memory buffer.

3. When to Tune UP?

You should only increase the context window for these tasks:

  • Code Review: When you need the AI to see 10 large files at once.
  • Legal/Academic Synthesis: Reading and comparing multiple PDFs.
  • Creative Writing: Writing a long chapter where the AI needs to remember what happened on page 1.

4. How to Change Context Safely

If you are moving to a large context (e.g., 64,000), follow these steps:

  1. Check your total VRAM.
  2. Calculate the math: A 64k context for Llama 3 can take ~4GB of extra VRAM.
  3. Subtract and Test: If your model is 5GB and your context is 4GB, you need 9GB total. If you have 8GB VRAM, the model will be extremely slow.

Recommendation: Increase context in increments of 4096 and check ollama ps to see how much memory is being claimed.


Key Takeaways

  • Context usage is dynamic: It grows as the conversation gets longer.
  • Lowering num_ctx frees up VRAM for larger, smarter models.
  • Increasing num_ctx above 16k requires a high-end GPU or Mac Studio.
  • Always match your context window to the specific task—never use "Max" by default.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn