Module 6 Lesson 2: Sampling Strategies

In the previous lesson, we learned that LLMs predict the next token by calculating probabilities for every word in their vocabulary.

But what happens once the model has those probabilities? Does it always just pick the word with the highest score? If it did, it would be very boring and repetitive. To add variety and creativity, we use Sampling Strategies.

1. Greedy Decoding (The Boring Way)

The simplest strategy is Greedy Decoding. The model looks at the list and simply picks the token with the highest probability.

Pros: Fast and predictable.
Cons: Repetitive. The model can get stuck in "loops" (e.g., "I like to eat apples because apples are good for eating apples...").

2. Temperature (The Creativity Knob)

Temperature is a mathematical scaling factor applied to the probabilities before the model makes a choice.

Low Temperature (e.g., 0.1): Makes the high probabilities even higher and the low ones lower. The model becomes very focused and picks the same words every time. (Best for code and factual summaries).
High Temperature (e.g., 0.8+): Flattens the probabilities, making middle-of-the-road words more likely to be picked. This increases "creativity" but also makes the model more prone to nonsense (hallucinations).

graph TD
    Scores["Raw Scores: 'A' (40%), 'B' (10%)"] --> Temp["Temperature Control"]
    Temp -- "T=0.1" --> Focused["'A' (99%), 'B' (1%)"]
    Temp -- "T=1.5" --> Random["'A' (30%), 'B' (25%)"]
    Focused --> Choice1["Likely picks 'A'"]
    Random --> Choice2["Could pick 'A' or 'B'"]

3. Top-k and Top-p (The Selection Filters)

To prevent the model from picking a completely nonsensical word (like picking "Potato" as the next word in a poem about the sun), we use filters.

Top-k Sampling

The model only considers the top K most likely words (e.g., K=50) and ignores the rest of the 50,000+ words in its vocabulary.

Top-p (Nucleus) Sampling

This is more modern. The model adds up the probabilities of the top words until it reaches a total probability of P (e.g., P=0.9 or 90%).

If the model is very sure, Top-p might only include 2 words.
If the model is confused, Top-p might include 200 words.

4. Why this matters for Developers

When you build an AI application, choosing the right sampling strategy can make or break the user experience.

Email Automation: Use low temperature (0.2) so the emails stay professional and consistent.
Gaming NPC: Use high temperature (0.9) so the characters seem alive and unpredictable.

Lesson Exercise

Goal: Compare Sampling results.

Imagine the model predicts:

Cat: 45%
Dog: 40%
Bird: 10%
Zebra: 5%

Greedy Decoding: Which word is picked?
Top-k (K=2): Which words are considered? Is "Zebra" ever possible?
Low Temperature: Will the gap between Cat and Zebra get bigger or smaller?

Observation: You can see how these math "knobs" completely change the statistical fate of the output!

Summary

In this lesson, we established:

Sampling determines how we select the next token from the probability list.
Temperature controls the "randomness/variety" of the output.
Top-k and Top-p protect the model from picking extremely low-probability (nonsense) words.

Next Lesson: We wrap up Module 6 by looking at the outcome of these choices. We'll discuss Why Outputs Change and the trade-offs between Determinism and Control.