Module 4 Lesson 4: Context Length and Tokens
·AI & LLMs

Module 4 Lesson 4: Context Length and Tokens

How much can the AI remember? Understanding the relationship between context windows and RAM usage.

Context Length: The AI's Working Memory

One of the most common frustrations in local AI is the model "forgetting" the start of a long conversation. This is limited by the Context Length (also called the Context Window).

1. What is a Token?

Before we talk about length, we need to define the unit of measurement: the Token.

  • A token is roughly 4 characters or 0.75 words.
  • The word "fantastic" might be 1 token.
  • The word "apple" might be 1 token.
  • A technical term like "quantization" might be 3 tokens (quant, iz, ation).

When we say a model has an 8k context window, it can hold roughly 6,000 words in its head at once.


2. The Context Window

The context window is the sum of:

  1. Your System Prompt (instructions).
  2. Every previous message in the chat history.
  3. Your new prompt.
  4. The model's current response.

If you hit the limit (e.g., 8,192 tokens), the model has to "forget" the oldest messages to make room for new ones. This is why the AI might forget your name halfway through a 50-page document analysis.


3. The RAM Cost of Context

This is the part many people miss: Context uses RAM/VRAM.

Unlike the model weights (which stay the same size), the "KV Cache" (the memory the model uses to track context) grows as the conversation gets longer.

  • small context (4k): ~200MB of extra VRAM.
  • Large context (128k): Can take 10GB+ of extra VRAM, even if the model itself is small.

This is why Ollama often defaults to a smaller context (usually 4k or 8k) even if the model can support 128k.


4. How to Change Context in Ollama

In the CLI, you can see the current limit with /show info. If you want to increase it for a specific session, you can use: /set parameter num_ctx 32768

Warning: If you set this too high and don't have enough RAM, Ollama will crash or become incredibly slow as it moves memory to your swap disk.


Summary Comparison

ModelDefault ContextMax Supported ContextIdeal Use Case
Llama 3 (8B)8k128kDaily chat, short summaries.
Mistral819232kFast creative writing.
Command R8k128kRAG (Searching long documents).

Key Takeaways

  • Tokens are the units LLMs use (roughly 0.75 words per token).
  • Context Window defines the "Short Term Memory" of the model.
  • Increasing context increases memory (RAM) usage significantly.
  • Use num_ctx in a Modelfile or CLI to adjust the limit based on your hardware.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn