Tokenization: How AI Reads

Computer "Brains" do not understand words. They only understand numbers. If you send the word "Hello" to an LLM, the first step is to turn that string into a list of integers. This process is called Tokenization.

1. Why not use Characters?

If we processed one character at a time ('H', 'e', 'l', 'l', 'o'), the model would have to learn that 'H' usually comes before 'e'. It would spend all its brainpower learning how to spell rather than learning how to think.

2. Why not use whole Words?

If we gave every word its own number ("Apple" = 1, "Apples" = 2), the dictionary would be millions of entries long. Every time a new slang word or emoji was invented, the model would be broken.

3. The "Subword" Solution: BPE

Most modern models use Byte Pair Encoding (BPE). BPE looks for frequent patterns.

Common words get their own token (e.g., "the").
Uncommon words are broken into pieces (e.g., "antigravity" -> "anti" + "gravity").
Suffixes like "-ing", "-ed", and "-ly" are often their own tokens.

This is why an LLM can understand words it has never seen before, as long as it understands the "pieces" of that word.

4. Tokenization and Language

This is where the "English Bias" of AI comes from.

English: A 1,000-word essay might be 1,300 tokens.
Japanese/Hindi: A 1,000-word essay might be 4,000 tokens because the tokenizer doesn't "know" the characters as well and has to break them into tiny fragments.

Local AI Tip: If you are using a model for a non-English language, check if it has a "Multi-lingual" tokenizer (like Gemma 2 or Llama 3). It will be much more efficient and "cheaper" on your memory.

5. Tokenization Errors!

Have you ever asked an AI: "How many 'r's are in the word strawberry?" And it says "two" (it's three). This is because of the tokenizer! The model doesn't see "s-t-r-a-w-b-e-r-r-y". It sees: [Straw] [berry]

Since the model never sees the individual letters, it has to "guess" how many 'r's are inside those tokens. This is a fundamental limitation of the Transformer architecture.

Key Takeaways

Tokenization is the process of splitting text into numeric chunks.
Most models use Subword tokenization (BPE) to handle new or complex words.
Tokenizers determine the Efficiency of the model in different languages.
Many "hallucinations" about spelling or counting are actually Tokenization errors, not intelligence errors.

Module 4 Lesson 5: Tokenization