How Tokenizers Work (Byte-Pair Encoding)

How Tokenizers Work (Byte-Pair Encoding)

Master the bridge between text and numbers. Understand the Byte-Pair Encoding (BPE) algorithm and how it defines a model's 'Vocabulary'.

How Tokenizers Work: The Numbers under the Words

Computers do not understand words. They do not understand "Cat" or "Run" or "JSON." They only understand numbers (Integars). The process of turning your text into a list of numbers is called Tokenization.

If you get tokenization wrong during fine-tuning, your model will be "Alphabetically Blind." It might see the word Aspirin as two separate pieces that make no sense to it. To be a fine-tuning expert, you must understand the algorithm that powers almost every modern LLM: Byte-Pair Encoding (BPE).

In this lesson, we will demystify BPE and why "Tokens" are not the same as "Words."


1. What is a Token?

A token can be a whole word, a part of a word (a subword), or even a single character.

  • Common Word: "The" -> 1 token.
  • Rare Word: "Anthropocentrism" -> 4 tokens (An, thro, po, centrism).
  • Punctuation: "!" -> 1 token.

The Token-to-Word Rule of Thumb

In English, 1,000 tokens is roughly equal to 750 words.


2. Byte-Pair Encoding (BPE) Explained

Why don't we just give every word in the dictionary a number?

  1. The Vocabulary Explosion: There are millions of words. The model's "Output Layer" would have to be millions of units wide, which is computationally impossible.
  2. Out-of-Vocabulary (OOV): If the model sees a new word (e.g., ShShell), it would break.

BPE solves this by starting with characters and iteratively merging the most frequent pairs of tokens.

The BPE Algorithm:

  1. Start with every character as a token.
  2. Count which two tokens appear next to each other most often in the training data (e.g., e and r making er).
  3. Merge them into a new, single token (er).
  4. Repeat until you reach a target Vocabulary Size (e.g., 32,000 or 128,000 tokens).
graph TD
    A["'low', 'lower', 'newest', 'widest'"] --> B["Split: l, o, w, e, r, n, ..."]
    B --> C["Merge 'e' + 's' -> 'es'"]
    C --> D["Merge 'es' + 't' -> 'est'"]
    D --> E["Final Vocab: [l, o, w, er, est, n, ...]"]
    
    subgraph "Compression Logic"
    C
    D
    end

3. Why Tokenization Matters for Fine-Tuning

During fine-tuning, you are working with a model that has a Fixed Vocabulary. You cannot add new tokens to a model like GPT-4 or Llama 3 without massive retraining.

The "Subword" Trap

If you are fine-tuning on a specialized medical domain with the word Xylometazoline:

  • The tokenizer might split it into: [X, y, l, ome, taz, oline]
  • The model has to learn the "Relationship" between these 6 fragments to understand the drug.
  • Fine-Tuning Impact: You need more examples of this word for the model to "Glue" those tokens together in its weights.

Implementation: Visualizing Tokens in Python

Using the tiktoken library (from OpenAI) or transformers (huggingface), we can see exactly how our training data is being "Sliced."

import tiktoken

# 1. Load the tokenizer for GPT-4
enc = tiktoken.encoding_for_model("gpt-4o")

# 2. Tokenize a technical sentence
text = "Fine-tuning with LoRA optimizes VRAM usage."
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token IDs: {tokens}")

# 3. Decode back to see the 'Subword' splits
for t in tokens:
    print(f"ID {t} -> '{enc.decode([t])}'")

# Example Output:
# ID 2577 -> 'Fine'
# ID 12 -> '-'
# ID 3450 -> 'tuning'
# ID 449 -> ' with'
# ID 22678 -> ' Lo'
# ID 64 -> 'RA'

Notice how LoRA was split into Lo and RA. This is why fine-tuning is needed—to teach the model that when Lo and RA appear together in this context, they mean a specific mathematical technique.


Summary and Key Takeaways

  • Tokenization is the conversion of text into integers.
  • BPE (Byte-Pair Encoding) is the algorithm that balances vocabulary size and compression.
  • Subwords: Most technical terms are split into multiple tokens.
  • Vocabulary is "Frozen": You can't add new words to a pretrained model's tokenizer easily.
  • Impact: Understanding how your domain data is tokenized helps you understand how much training data you need.

In the next lesson, we will look at the architecture of the Vocabulary and Special Tokens, focusing on the "Control Tokens" that manage the conversation.


Reflection Exercise

  1. Use a token counter online. Type the word aaaaa vs a a a a a. Which one generates more tokens? Why?
  2. If the phrase "Agentcore is the future" is split into 5 tokens, how many "Number Updates" must the model perform during backpropagation for that one sentence?

SEO Metadata & Keywords

Focus Keywords: How LLM Tokenizers Work, Byte-Pair Encoding BPE, Tiktoken vs SentencePiece, subword tokenization AI, vocabulary size LLM. Meta Description: Master the bridge between text and numbers. Learn the Byte-Pair Encoding (BPE) algorithm, how tokens differ from words, and why tokenization is critical for fine-tuning success.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn