Module 7 Lesson 4: Context Window Tuning
Stability over scope. Why lowering your context window can actually make your AI feel faster and more stable.
Context Window Tuning: Less is More
In Module 4, we learned that the Context Window is the AI's "Short-Term Memory." While it's tempting to set this to 100,000 words, doing so can kill your performance.
Here is how to tune num_ctx for maximum stability.
1. The VRAM-to-Context Equation
Every token in the context window consumes VRAM.
- Small Context (2,048): Very light. The model is very stable even on 8GB RAM.
- Medium Context (8,192): Standard. Fits on most modern GPUs.
- Large Context (32,768+): Heavy. Might push a 12GB GPU into the "Slow Zone" (System RAM).
2. Why Tune Down?
If you are building a simple chat bot or a translation tool, you do not need 8,000 tokens of memory.
By setting PARAMETER num_ctx 2048 in your Modelfile:
- Lower VRAM usage: You might be able to run a "smarter" model (like 14B instead of 8B) because you saved VRAM on the context window.
- Faster TTFT: The model spent less time "pre-calculating" the memory buffer.
3. When to Tune UP?
You should only increase the context window for these tasks:
- Code Review: When you need the AI to see 10 large files at once.
- Legal/Academic Synthesis: Reading and comparing multiple PDFs.
- Creative Writing: Writing a long chapter where the AI needs to remember what happened on page 1.
4. How to Change Context Safely
If you are moving to a large context (e.g., 64,000), follow these steps:
- Check your total VRAM.
- Calculate the math: A 64k context for Llama 3 can take ~4GB of extra VRAM.
- Subtract and Test: If your model is 5GB and your context is 4GB, you need 9GB total. If you have 8GB VRAM, the model will be extremely slow.
Recommendation: Increase context in increments of 4096 and check ollama ps to see how much memory is being claimed.
Key Takeaways
- Context usage is dynamic: It grows as the conversation gets longer.
- Lowering num_ctx frees up VRAM for larger, smarter models.
- Increasing num_ctx above 16k requires a high-end GPU or Mac Studio.
- Always match your context window to the specific task—never use "Max" by default.