Module 5 Lesson 3: Layers and Depth

In the last lesson, we saw how one "Attention Block" can help a token find its related neighbors. But a single block can only do so much. Modern LLMs are Deep, meaning they stack these blocks on top of each other dozens (or even hundreds) of times.

Think of it like a factory assembly line. In this lesson, we will explore why more layers lead to "smarter" models.

1. The Assembly Line Analogy

If you are building a car on an assembly line:

Station 1: Bolting the frame together (Simple task).
Station 2: Installing the engine.
Station 3: Adding the electronic sensors.
Station 10: Testing the self-driving AI.

An LLM works the same way. The signals pass through many layers, and each layer adds a new level of "Abstraction."

2. What happens in each layer?

While every layer has a similar mathematical structure, their "roles" naturally separate during training:

Lower Layers (Entry Level): These focus on the basics. They understand grammar, punctuation, and if a word is a noun or a verb.
Middle Layers (Mid Level): These begin to understand local relationships. They identify phrases, idioms, and simple facts.
Upper Layers (High Level): This is where the "Intelligence" lives. These layers capture abstract concepts, sarcasm, logical contradictions, and the overarching intent of the user.

graph TD
    Input["Tokens"] --> L1["Layer 1: Grammar & Syntax"]
    L1 --> L2["Layer 2: Local Phrases & Entities"]
    L2 --> L3["... more layers ..."]
    L3 --> LN["Layer 96: Abstract Reasoning & Logic"]
    LN --> Output["Final Predicted Token"]

3. Why more layers = More capability?

A "Shallow" model (with 3-4 layers) might be able to summarize a grocery list, but it will never understand a philosophical argument.

Each added layer allows the model to perform a more complex "transformation" on the data. For example, to understand a joke, the model has to:

Understand the words (Layer 1).
Understand the premise (Layer 20).
Identify the "unexpected twist" that makes it funny (Layer 60).

4. Residual Connections: The "Elevator" for Data

One problem with stacking 100 layers is that the original signal can get "lost" or blurry by the time it reaches the top. To solve this, Transformers use Residual Connections (or Skip Connections).

Basically, the model says: "Here is the output of the attention layer, but also, here is a copy of the original input just in case you forgot." This allows the gradient (the learning signal) to flow easily through a very deep model without fading away.

Lesson Exercise

Goal: Model the Hierarchy of Features.

Imagine you are looking at a photo of a face.

Layer 1 detects simple lines and edges.
Layer 10 detects shapes (circles, triangles).
Layer 50 detects eyes, noses, and mouths.
Layer 100 detects whose face it is.

Observation: This is exactly how LLMs treat text. They move from "dots and dashes" to "poetry and code."

Summary

In this lesson, we established:

Transformers are built by stacking attention blocks into layers.
Lower layers handle simple syntax; upper layers handle complex reasoning.
Residual connections allow models to become incredibly deep without losing their way.

Next Lesson: We look at the one thing Transformers didn't have: a sense of order. We'll learn about Positional Encoding and how the model knows "Dog bites man" is different from "Man bites dog."