
How Generative AI Works: LLMs, Transformers, and Foundation Models
A non-technical deep dive into the engine of Generative AI. We explain Large Language Models (LLMs), why Tokens matter more than words, and how the Transformer architecture changed everything.
Under the Hood: How the Magic Happens
In the previous lesson, we established that Generative AI is a subset of Deep Learning designed to create new content. But how does it actually do that? How does a computer program "write" a Shakespearean sonnet or "draw" a photo-realistic astronaut?
The answer lies in three key concepts: Foundation Models, Transformers, and probable next-step prediction.
In this lesson, we will dissect the machinery of Generative AI without getting bogged down in complex calculus. We will cover the vocabulary you need to sound like an expert: LLMs, Tokens, Parameters, and Temperature.
1. The Paradigm Shift: Foundation Models
Before 2018, if you wanted an AI to translate languages, you trained a "Translation Model." If you wanted it to summarize text, you trained a "Summarization Model." Each model was a specialist, trained on specific data for a specific task.
Then came the concept of the Foundation Model.
A Foundation Model is a single, massive AI model trained on a vast amount of data (often the entire public internet) using self-supervision. It is not trained to do just one thing; it is trained to understand patterns effectively enough to be adapted to many things.
- Analogy:
- Traditional AI: A factory worker trained specifically to tighten one bolt on a car assembly line.
- Foundation Model: A highly educated scholar who has read every book in the library. With a little bit of extra instruction, this scholar can write poetry, solve math problems, or translate French.
Large Language Models (LLMs)
An LLM is a specific type of Foundation Model trained on text. Examples include Google's Gemini, PaLM, and OpenAI's GPT.
- "Large": Refers to the number of parameters (often billions).
- "Language": Refers to the data it was trained on.
- "Model": The mathematical representation of the patterns.
2. Key Concepts: The Vocabulary of LLMs
To manage GenAI projects, you need to speak the language. There are three critical terms: Tokens, Parameters, and Transformers.
A. Tokens (Not Words)
LLMs do not read "words" like humans do. They read "tokens." A token can be a word, part of a word, or even a single character.
- General Rule: 1000 tokens ≈ 750 words.
- Why it matters: You pay for Generative AI by the token (both input and output). Understanding token efficiency is a direct lever on your project's P&L.
Code Example: Visualizing Tokens
Let's use Python to see how an AI sees a sentence. Note that common words are single tokens, but complex words are broken up.
# Conceptual Tokenization Logic
# (In reality, models use specific tokenizers like SentencePiece or TikToken)
def simple_tokenize(text):
# This is a simplification.
# Real tokenizers define tokens by frequency, not just spaces.
# "Generative AI is fascinating"
# Token 1: "Generative"
# Token 2: " AI"
# Token 3: " is"
# Token 4: " fascin"
# Token 5: "ating"
pass
input_text = "Generative AI is revolutionary."
# The model sees: [1942, 12838, 318, 12938]
# It processes these numbers, not the letters.
B. Parameters (The Brain Cells)
Parameters are the internal variables that the model learned during training. They represented the "weights" or strength of connections between concepts.
- More Parameters generally means the model is "smarter" and can reason about more complex topics.
- Trade-off: More parameters = More expensive to run (slower, requires more GPU memory).
- Gemini Hierarchy:
- Nano: Fewer parameters (runs on a phone).
- Ultra: Trillions of parameters (runs in a massive data center).
C. The Transformer (The Engine)
The "T" in ChatGPT stands for Transformer. This is the neural network architecture invented by Google researchers in the 2017 paper "Attention Is All You Need."
Before Transformers, AI read sentences linearly (left to right). It often forgot the beginning of a sentence by the time it reached the end.
Transformers introduced Self-Attention. This allows the model to look at all words in a sentence at the same time and understand the relationship between keywords, regardless of their distance.
graph TD
Input["Input: The bank of the river"] --> Embedding["Convert to Numbers"]
Embedding --> Attention["Self-Attention Mechanism"]
Attention --> Context[" Context: Bank refers to land, not money"]
Context --> FF["Feed Forward Network"]
FF --> Output["Output Prediction"]
style Attention fill:#FFD700,stroke:#333,stroke-width:2px,color:#000
In the diagram above, the "Self-Attention" mechanism realizes that the word "Bank" is associated with "River", so it assigns the meaning of "River Bank" rather than "Financial Bank."
3. How It Generates: Next-Token Prediction
At its core, a Generative AI model is just a really, really advanced autocomplete engine. It predicts the most probable next token based on the tokens that came before it.
The Prompt: "The sky is"
The Model's Brain:
- Analyzes "The sky is".
- Self-Attention looks at context.
- Calculates probabilities for the next word:
- "blue": 85%
- "cloudy": 10%
- "falling": 0.01%
- "green": 0.001%
- Selection: It picks "blue" (usually).
Creativity comes from "Temperature"
If the model always picked the highest probability word, it would be boring and repetitive. Temperature is a setting (0.0 to 1.0) that controls randomness.
- Low Temperature (0.1): Always picks the most likely word. Good for coding, math, and factual answers.
- High Temperature (0.9): Sometimes picks lower probability words. Good for poetry, brainstorming, and creative writing.
4. Types of Generative Models
While LLMs are the most famous, the "Generative" landscape allows for different input/output modalities.
A. Text-to-Text
- Input: Text
- Output: Text
- Examples: Google PaLM, LaMDA.
- Use Cases: Summarization, Translation, Chatbots, Code Generation.
B. Text-to-Image
- Input: Text Description
- Output: Pixel Data (Image)
- Examples: Google Imagen, Midjourney.
- Use Cases: Marketing assets, logo design, storyboarding.
C. Multimodal (The New Standard)
- Input: Text, Image, Audio, Video, Code
- Output: Text, Image, Audio, etc.
- Examples: Gemini.
- Revolution: Instead of having separate models for separate tasks, Multimodal models can process "The world" widely. You can show Gemini a video of a leaky faucet and ask, "How do I fix this?" It analyzes the video frames (Vision) and provides a text/code solution.
5. Summary
- Foundation Models changed AI from "Specialized Factory Workers" to "Educated Scholars" that can be adapted to many tasks.
- LLMs process Tokens, not words.
- Parameters are the measure of a model's complexity and "intelligence."
- Transformers use Self-Attention to understand context better than any previous architecture.
- Multimodality (Gemini) is the future, combining text, vision, and audio into a single reasoning engine.
In the next lesson, we will move from how it works to how to control it. We will tackle Hallucinations and the art of Prompt Engineering.
Knowledge Check
?Knowledge Check
Why might a business leader care about the 'Context Window' of a Foundation Model?