Module 6 Lesson 1: From Prompt to Output

Up until now, we've focused on how a model is built and trained. Now it's time to see it in action. When you type a message into a chat window and press Enter, the model enters the Inference phase.

Unlike training (which takes months), inference happens in seconds. In this lesson, we will follow a single prompt through the model to see how it "predicts" its way to a full response.

1. Step 1: Ingestion and Internalizing the Prompt

When you send a prompt like "Write a poem about a duck," the model doesn't just start writing.

Tokenization: Your sentence is sliced into tokens.
Feeding Forward: These tokens are fed through all the attention layers we studied in Module 5.
The Context State: The model builds a "summary" of everything you just said. It now "understands" that the topic is Ducks and the format is Poem.

2. Step 2: The First Token Prediction

The goal of this first pass is to predict exactly one token.

The model looks at its vocabulary and decides that the most likely first word of a duck poem is "In".
It outputs "In".

3. Step 3: The Autoregressive Loop

This is the most important part of inference. To get the second token, the model needs to know what the first token was.

The model takes your original prompt: "Write a poem about a duck"
It appends its own output: "In"
It feeds the entire combined string back into the start.
It predicts the next token: "the".

This repeats over and over. Prompt -> In Prompt + In -> the Prompt + In the -> reeds

graph LR
    P["Original Prompt"] --> L1["Model Run 1"]
    L1 --> T1["Token 1"]
    T1 --> P2["Prompt + T1"]
    P2 --> L2["Model Run 2"]
    L2 --> T2["Token 2"]
    T2 --> P3["Prompt + T1 + T2"]
    P3 --> Loop["...repeat until STOP token..."]

4. Why Inference feels "Streaming"

Have you noticed how ChatGPT writes one word at a time, like it's typing? That's not just a visual effect; it's exactly how the model works. Because it needs its own previous word to calculate the next one, it cannot write the whole paragraph at once. It must be a "Stream."

5. Summary: The Generation Engine

Inference is a recursive loop. The model isn't "retrieving" an answer; it is constructing it, one piece at a time, based on the statistical path it established in its training.

Lesson Exercise

Goal: Act like an Autoregressive Model.

Use this prompt: "Once upon a..."
Select the most likely next word (e.g., "time").
Now use "Once upon a time" to select the next word (e.g., "there").
Do this for 5 more words.

Observation: Notice how each word you chose narrowed down the possibilities for the next one. This is the "path" the model follows.

What’s Next?

In Lesson 2, we explore how we control the "Variety" of these paths. We'll learn about Sampling Strategies—the math that decides whether the model picks the #1 most likely word or takes a creative risk with the #2 or #3 option.