Module 2 Lesson 2: Tokenization – The Vocabulary of AI

In the last lesson, we learned that computers need numbers. Tokenization is the process of slicing text into small chunks called "Tokens," which can then be mapped to those numbers.

If you understand tokenization, you understand why LLMs struggle with certain jokes, why they have "context limits," and how they manage to handle almost any language on Earth.

1. What exactly is a Token?

A token is the basic unit of text for an LLM. It can be a whole word, a single character, or a piece of a word (a subword).

On average, for modern models like GPT-4, 1 token is about 4 characters, or roughly 0.75 words.

Why not just use whole words?

Imagine if we used every single word in the English language as a unique token. Our "vocabulary" would be millions of items long. It would be impossible for the model to handle typos, rare names, or new words like "TikTok."

2. Subword Tokenization (The Standard)

Modern LLMs use Subword Tokenization (like BPE - Byte Pair Encoding). This means the model breaks common words into one token, but breaks rare words into several pieces.

Example:

The word "High" is 1 token.
The word "Highland" might be 2 tokens: ["High", "land"].
The word "Highlander" might be 3 tokens: ["High", "land", "er"].

graph TD
    Input["Input: 'Highlander'"] --> T1["Token 1: 'High'"]
    Input --> T2["Token 2: 'land'"]
    Input --> T3["Token 3: 'er'"]
    T1 --> ID1["ID: 1530"]
    T2 --> ID2["ID: 802"]
    T3 --> ID3["ID: 260"]

The Benefits

Compact Vocabulary: Models only need 50,000 to 100,000 tokens to represent almost any word in any language.
Handling New Words: If the model sees a new word like "Zorpian," it can break it down into ["Zorp", "ian"] based on subword patterns it already knows.

3. Why Tokenization Matters for You

Tokenization isn't just a hidden technical step; it has real-world consequences:

Pricing: Almost every AI API (OpenAI, Anthropic) charges you based on the number of tokens, not the number of words.
Math & Spelling: Ever notice an LLM can't count the letters in a word correctly? That's because it doesn't "see" letters; it sees the token ID for the whole word.
Latency: The model generates text one token at a time. The more tokens in your output, the longer it takes.

4. Hands-on Experiment

The Tokenizer Tool: Go to the OpenAI Tokenizer.

Type your full name. Note how many tokens it uses.
Type a very long word like "Antidisestablishmentarianism." See how it's sliced into pieces.
Try a sentence with lots of emojis. Emojis usually take 2-4 tokens each!

Summary

In this lesson, we covered:

Tokens are the "chunks" LLMs use to process text.
Subword tokenization allows models to handle infinite combinations with a limited vocabulary.
Token counts determine cost, speed, and sometimes accuracy.

Next Lesson: We look at the "container" that holds these tokens: The Context Window. We'll learn why models eventually "forget" the beginning of a long conversation.