Module 2 Lesson 3: Token Limits and Context Windows

Every Large Language Model has a limit to how much information it can "think about" at one time. This limit is called the Context Window.

If you think of an LLM as a student sitting at a desk, the Context Window is the size of that desk. They can only work with the books and papers that fit on it. Everything else has to go in a filing cabinet (long-term memory) or be thrown away.

1. What is the Context Window?

The Context Window is the maximum number of tokens a model can process in a single interaction. This includes:

System Prompt: The instructions given to the model (e.g., "You are a helpful assistant").
Conversation History: All the previous messages between you and the AI.
New Input: Your latest question or text.
The Generation: The tokens the model is about to write.

If the sum of these exceeds the limit, the model will "forget" the earliest piece of information to make room for the new input.

2. Why is there a Limit?

You might wonder: "In the age of cloud computing, why can't we have an infinite context window?"

The answer is Compute Complexity. In the standard Transformer architecture (which we will study in Module 5), the amount of work the computer has to do grows quadratically with the length of the input.

If you double the text, the computation doesn't just double; it increases by 4 times.
This makes processing very long text extremely expensive and slow.

graph LR
    subgraph "Small Context (8k)"
        C1["Fast Inference"]
        P1["Low Cost"]
    end
    subgraph "Large Context (128k+)"
        C2["Slower Inference"]
        P2["High Cost"]
        R1["Better Reasoning over Long Documents"]
    end

3. The "Needle in a Haystack" Problem

Even if a model has a huge context window (like 1 Million tokens), it still might struggle to find a specific fact buried in the middle of that text. This is a common benchmark in AI research: can the model find the "needle" (the fact) in the "haystack" (the massive context)?

As a general rule: Models are best at remembering facts at the very beginning and the very end of a prompt.

4. Managing the Window in Apps

Because the window is finite, developers use two main strategies:

Sliding Window: Removing the oldest messages as new ones come in.
Summarization: Periodically taking the conversation history and asking the LLM to summarize it into a compact format, then using that summary as the "memory."

Lesson Exercise

Goal: Estimate the cost of a long conversation.

Imagine you have a conversation that is 10,000 words long.
Multiply words by 1.3 to get an approximate token count (~13,000 tokens).
If a model costs $0.01 per 1k tokens, how much does one "Reply" cost if the model has to re-read the entire history every time?

Observation: This is why "Short and Concise" prompts aren't just polite—they're cheaper!

Conclusion of Module 2

You've now mastered the "Digital Senses" of an LLM:

Lesson 1: Why computers need numbers.
Lesson 2: How Tokenization slices text into those numbers.
Lesson 3: The limits of how many tokens a model can hold at once.

Coming Up in Module 3: We solve the final mystery of text representation. If a token is just an ID like 1530, how does the model know it's a "Fruit"? We'll learn about Embeddings—Meaning as Numbers.