Module 3 Lesson 2: How Embeddings Are Learned

How does a computer "know" where to put "Apple" on its conceptual map? No human sat down and typed in: "Apple belongs at coordinates X, Y, Z."

Instead, the model learned these positions by reading billions of pages of text. In this lesson, we will explore the intuition behind this learning process, specifically the concept of Distributional Semantics.

1. "You Shall Know a Word by the Company it Keeps"

This is the golden rule of embeddings. To figure out what a word means, the model looks at the words that usually appear around it.

Example:

"The king sat on the throne."
"The queen sat on the throne."
"The child sat on the chair."

The model notices that "King" and "Queen" both appear next to "Throne" and "Sat on." It also notices that "Throne" and "Chair" appear after "Sat on."

Because "King" and "Queen" share a very similar "neighborhood" of neighboring words, the model mathematically drags their vectors closer together.

2. The Next-Word Prediction Game

Remember, the core objective of an LLM is to predict the next token. If the model incorrectly predicts that "The King sat on the Cloud," it realizes it made an error because its training data said "Throne."

To fix this error, the model slightly adjusts the numbers (the weights) in its embedding layer. Over billions of these tiny adjustments, the model slowly "perfects" the map so that its predictions become more and more accurate.

graph LR
    Sentence["'The [Target] sat on the throne.'"] --> Prediction["AI Predicts: 'Table' (Wrong)"]
    Prediction --> Error["Loss Function calculates error"]
    Error --> Adjust["Adjustment: Move 'King' closer to 'Throne'"]
    Adjust --> NewMap["Updated Embedding Space"]

3. Beyond Simple Definitions

What makes modern LLM embeddings so powerful is that they capture context beyond just "Synonyms." They capture:

Hierarchies: Knowing a "Poodle" is a type of "Dog."
Attributes: Knowing that "Fire" is "Hot."
Relationships: The classic example: King - Man + Woman = Queen.

Because the model has seen millions of mentions of Kings being related to Men and Queens being related to Women, the mathematical "distance" between King and Man becomes similar to the distance between Queen and Woman.

4. Why Contextual Embeddings (Like GPT) are Better

Older systems (like Word2Vec) had one fixed vector for every word. But words have different meanings!

"I am at the river bank." (Financial?) No.
"I am going to the bank to deposit money."

Modern Transformers create Contextual Embeddings. They start with a baseline vector and then "shove" it around based on the other words in the sentence. This allows the model to understand that the "Bank" in sentence A is different from the "Bank" in sentence B.

Lesson Exercise

The Neighborhood Test:

Take the word "Coffee."
List 5 words that frequently appear near it (e.g., Mug, Morning, Bean, Sugar, Drink).
Take the word "Software."
List 5 words near it (e.g., Debug, Code, App, User, Python).

Observation: Notice how there is almost zero overlap. This is exactly how the model keeps these concepts in different zip codes on its map!

Summary

In this lesson, we learned:

Models learn meaning by observing word neighborhoods (co-occurrence).
Embeddings are refined through the "Next Word Prediction" error-correction loop.
Modern embeddings are dynamic and change based on the surrounding context of the sentence.

Next Lesson: We'll put this into practice. How do we actually use these vectors for features like AI Search and Retrieval-Augmented Generation (RAG)?