Module 4 Lesson 4: Context Length and Tokens
How much can the AI remember? Understanding the relationship between context windows and RAM usage.
Context Length: The AI's Working Memory
One of the most common frustrations in local AI is the model "forgetting" the start of a long conversation. This is limited by the Context Length (also called the Context Window).
1. What is a Token?
Before we talk about length, we need to define the unit of measurement: the Token.
- A token is roughly 4 characters or 0.75 words.
- The word "fantastic" might be 1 token.
- The word "apple" might be 1 token.
- A technical term like "quantization" might be 3 tokens (
quant,iz,ation).
When we say a model has an 8k context window, it can hold roughly 6,000 words in its head at once.
2. The Context Window
The context window is the sum of:
- Your System Prompt (instructions).
- Every previous message in the chat history.
- Your new prompt.
- The model's current response.
If you hit the limit (e.g., 8,192 tokens), the model has to "forget" the oldest messages to make room for new ones. This is why the AI might forget your name halfway through a 50-page document analysis.
3. The RAM Cost of Context
This is the part many people miss: Context uses RAM/VRAM.
Unlike the model weights (which stay the same size), the "KV Cache" (the memory the model uses to track context) grows as the conversation gets longer.
- small context (4k): ~200MB of extra VRAM.
- Large context (128k): Can take 10GB+ of extra VRAM, even if the model itself is small.
This is why Ollama often defaults to a smaller context (usually 4k or 8k) even if the model can support 128k.
4. How to Change Context in Ollama
In the CLI, you can see the current limit with /show info.
If you want to increase it for a specific session, you can use:
/set parameter num_ctx 32768
Warning: If you set this too high and don't have enough RAM, Ollama will crash or become incredibly slow as it moves memory to your swap disk.
Summary Comparison
| Model | Default Context | Max Supported Context | Ideal Use Case |
|---|---|---|---|
| Llama 3 (8B) | 8k | 128k | Daily chat, short summaries. |
| Mistral | 8192 | 32k | Fast creative writing. |
| Command R | 8k | 128k | RAG (Searching long documents). |
Key Takeaways
- Tokens are the units LLMs use (roughly 0.75 words per token).
- Context Window defines the "Short Term Memory" of the model.
- Increasing context increases memory (RAM) usage significantly.
- Use
num_ctxin a Modelfile or CLI to adjust the limit based on your hardware.