The DNA of LLMs: Tokenization, Embeddings, and Attention

Large Language Models do not "read" words like humans do. Beneath the surface of every agentic response is a massive, high-speed mathematical operation. For an LLM Engineer, understanding how language is converted into math is essential for debugging performance issues, optimizing token costs, and designing effective RAG systems.

In this lesson, we break down the three fundamental pillars of NLP: Tokenization, Embeddings, and Attention.

1. Tokenization: Splitting the World into Chunks

Models cannot process continuous strings of text. They need discrete units called Tokens.

How it Works:

Most modern LLMs use Subword Tokenization. This means common words are one token, but rare words are broken into pieces.

"The" $\rightarrow$ [The] (1 token)
"Antigravity" $\rightarrow$ [Anti, gravity] (2 tokens)
"apple" $\rightarrow$ [apple] (1 token)

graph LR
    A["Raw Text: 'Learning AI'"] --> B[Tokenize]
    B --> C["[Learn, ing, AI]"]
    C --> D[Numerical ID: [450, 23, 891]]

Why LLM Engineers Care:

Cost: You are billed per token, not per word.
Context Window: Every model has a physical limit (e.g., 128k tokens). If your prompt uses too many tokens for the instructions, there won't be room for the data.
Language Bias: Some languages are more "token-heavy" than others. For example, the same sentence in English might take 10 tokens, but 30 tokens in a language with a simpler tokenizer.

2. Embeddings: Meaning as Coordinates

Once we have tokens, how does the model know that "King" and "Queen" are related, but "King" and "Toaster" are not? The answer is Embeddings.

An Embedding is a long list of numbers (a vector) that represents the "meaning" of a token in a high-dimensional space.

The Vector Space

Imagine a map where words are located based on their characteristics:

X-axis: Royalty?
Y-axis: Gender?
Z-axis: Age?

If we use a 3D coordinate for "King," it might be [0.9, 0.1, 0.8]. "Queen" might be [0.9, 0.9, 0.8]. The distance between these points is small, so the model knows they are similar.

graph TD
    A[Token] --> B[Embedding Model]
    B --> C["Vector: [0.12, -0.45, 0.88, ...]"]
    C --> D[Semantic Search / RAG]

3. Attention: The "Focus" Mechanism

This is the "magic" that makes modern LLMs possible. Before 2017, NLP models struggled because they forgot the beginning of a sentence by the time they reached the end. Attention (specifically Multi-Head Self-Attention) fixed this.

What is Attention?

Attention allows the model to look at every word in a sentence and decide which ones are most relevant to the word it is currently processing.

Example Sentence: "The animal didn't cross the street because it was too tired."

When the model processes the word "it," the Attention mechanism highlights "animal." If the sentence was "it was too wide," Attention would highlight "street."

graph TD
    A[Word: 'bank'] --> B{Attention Context}
    B -- "river" nearby --> C["Meaning: Sloping Land"]
    B -- "money" nearby --> D["Meaning: Financial Institution"]

Code Example: Checking Tokens with `tiktoken`

As an LLM Engineer, you will often need to calculate tokens before sending them to an expensive API. OpenAI provides a library called tiktoken for this.

import tiktoken

def count_tokens(text: str, model_name: str = "gpt-4o"):
    # Get the encoding for the specific model
    encoding = tiktoken.encoding_for_model(model_name)
    
    # Encode the text into integer IDs
    tokens = encoding.encode(text)
    
    # Return count and the actual chunks
    return {
        "count": len(tokens),
        "token_ids": tokens,
        "decoded": [encoding.decode([t]) for t in tokens]
    }

text = "LLM Engineering is incredibly exciting!"
stats = count_tokens(text)

print(f"Total Tokens: {stats['count']}")
print(f"Token Chunks: {stats['decoded']}")

Summary for the LLM Engineer

Tokenization: The input format. Watch your token count to stay under budget and context limits.
Embeddings: The "semantic bridge." This is the core of all RAG (Retrieval-Augmented Generation) systems.
Attention: The reasoning logic. This is why models can handle long, complex instructions without getting confused.

In the next lesson, we will look at how these three pieces are wired together in a Neural Network, the "engine" that executes these operations.

Exercise: Token Guessing

Given the following sentence, how many tokens do you think it is for a GPT-style model? "AI is the new electricity, but prompt engineering is the plug."

Count the words.
Identify 2 words that might be split into multiple tokens.
Use a tool like the OpenAI Tokenizer (or the code above) to check your answer.

Understanding this "Token Density" will help you write much more cost-efficient prompts later in Module 4.