What Tokens Are and How They Are Counted

What Tokens Are and How They Are Counted

Discover the fundamental building blocks of LLM communication. Learn how text is transformed into tokens, why character counts don't equal token counts, and how to master the tokenization process.

What Tokens Are and How They Are Counted

Welcome to the first lesson of the Token Efficiency in LLM Use, Agentic AI, and Beyond course. Before we can optimize our AI systems for cost and performance, we must understand the "currency" of Large Language Models: Tokens.

In this lesson, we will peel back the layers of how machines read human language. We’ll move beyond the simplistic idea that "words are data" and explore the mathematical reality of tokenization.


1. The Core Definition: What is a Token?

A token is the basic unit of text that a Large Language Model (LLM) processes. If you think of an LLM as a highly advanced statistical engine, tokens are the individual pieces of information it uses to predict the next piece of information.

However, a token is not necessarily a word. It can be a single character, a part of a word (sub-word), or even a combination of punctuation and whitespace.

The rule of thumb

For English text:

  • 1 token ≈ 0.75 words
  • 100 tokens ≈ 75 words
  • 1 white space = usually its own token or part of the following word

Why not just use words?

If we used whole words, our "vocabulary" would be infinite. New words are created every day, and languages like German create massive compound words. By using sub-word tokens (like Byte Pair Encoding or BPE), models can represent any word in existence using a finite set of building blocks (usually between 32,000 and 128,000 unique tokens).


2. How Mapping Works: From Text to ID

When you send a prompt to AWS Bedrock or OpenAI, the first thing that happens is Tokenization. Your raw string is converted into a list of integers.

graph LR
    A[Raw Text: 'Hello world!'] --> B[Tokenizer]
    B --> C[Token IDs: 15496, 995, 0]
    C --> D[Model Embedding Layer]
    D --> E[Mathematical Vector]

Each integer corresponds to a specific token in the model's vocabulary. For example, in the GPT-4 vocabulary:

  • "The" might be 464
  • " the" (with a space) might be 262

This distinction is crucial: Whitespace matters. Extra spaces in your prompts are not just "blank"; they are consumed tokens that you pay for.


3. Counting Tokens in Python

To build production-grade applications, you cannot guess your token count. You must use tools like tiktoken (for OpenAI) or the transformers library (for Anthropic/Meta models).

Python Practice: Token Counting with Tiktoken

Here is how you can programmatically check the token count of a string before sending it to an API.

import tiktoken

def count_tokens(text: str, model="gpt-4") -> int:
    """
    Returns the number of tokens in a text string.
    """
    # Load the encoding for the specific model
    encoding = tiktoken.encoding_for_model(model)
    
    # Encode the text into token IDs
    tokens = encoding.encode(text)
    
    # The length of the list is our token count
    return len(tokens)

# Example usage
prompt = "Token efficiency is the secret to scalable AI."
num_tokens = count_tokens(prompt)

print(f"Text: '{prompt}'")
print(f"Token Count: {num_tokens}")

# Visualizing the tokens
encoding = tiktoken.encoding_for_model("gpt-4")
token_strings = [encoding.decode([t]) for t in encoding.encode(prompt)]
print(f"Tokens: {token_strings}")

Why this matters for FastAPI

If you are building an API that proxies LLM requests (common in enterprise middleware), you should count tokens before hitting the provider. This allows you to:

  1. Reject requests that exceed a user's budget.
  2. Route long prompts to models with larger context windows.
  3. Cache responses based on token fingerprints.

4. Sub-word Tokenization and Rare Words

Common words like "apple" are usually a single token. However, rare words, technical jargon, or code snippets are often broken into many small tokens.

Example: antigravity A model might see this as:

  • anti
  • gravity

Example: 0.00000001 Numbers are notoriously token-heavy. While 100 might be one token, 123,456.78 might be split into 4 or 5 tokens depending on the commas and decimals.

The Impact on Cost

If your application processes medical documents or legal contracts filled with rare terminology, your "word-to-token" ratio will be much higher than 0.75. You might find that 1,000 words of legal text consume 2,000 tokens, doubling your expected cost.


5. Visualizing the Tokenization Process

Understanding the "boundary" of tokens is a superpower for prompt engineers.

graph TD
    subgraph "Tokenization of 'Thinking...'"
        T1[Token 1: 'Thin']
        T2[Token 2: 'king']
        T3[Token 3: '...']
    end
    
    subgraph "Tokenization of ' Thinking '"
        T4[Token 1: ' Thinking']
        T5[Token 2: ' ']
    end

Notice how a leading space often gets merged into the word, but a trailing space often stands alone. These "Ghost Tokens" can add up in large-scale agentic loops.


6. Tokenization in AWS Bedrock

When using AWS Bedrock, different models use different tokenizers. For example, Claude 3 (Anthropic) uses a different vocabulary than Llama 3 (Meta).

If you are building a multi-model application using LangChain, you should use the get_token_ids method to ensure you are accurately measuring usage across different providers.

AWS Bedrock Example (Python SDK)

import boto3
import json

# Initialize the Bedrock client
bedrock = boto3.client(service_name='bedrock-runtime')

def invoke_efficiently(prompt):
    # Note: Bedrock doesn't always return token counts in the 
    # immediate response body for all models. Standardizing this
    # in your middleware is essential.
    body = json.dumps({
        "prompt": prompt,
        "max_tokens_to_sample": 200,
        "temperature": 0.5,
    })
    
    response = bedrock.invoke_model(
        body=body, 
        modelId="anthropic.claude-v2"
    )
    
    # Post-processing: Calculate tokens manually or use
    # Amazon CloudWatch metrics for precise accounting.
    return response

7. The Performance Trade-off

Counting tokens is not free. It adds a few milliseconds of latency to your request. However, in the context of an LLM call which might take 2-10 seconds, the overhead of a local tokenizer is negligible (microseconds).

Senior Engineer Advice: Always count tokens locally before sending a request. It is the only way to build a reliable, cost-aware system.


Summary and Key Takeaways

  1. Tokens are math, not language: Models don't see words; they see numerical IDs.
  2. 1,000 tokens ≈ 750 words: Use this for quick estimations, but code for precision.
  3. Sub-word splits: Rare words and code consume more tokens than common English.
  4. Validation is key: Use libraries like tiktoken to validate prompt size before making expensive API calls.

In the next lesson, we will dive into the economics of Input vs. Output Tokens and why the "direction" of the data changes the price you pay.


Exercise: The Tokenizer Test

  1. Predict how many tokens are in the phrase: "LLM orchestration is complex."
  2. Run the Python tiktoken script provided above to verify.
  3. Add three spaces between "is" and "complex" and see how the token count changes.

Congratulations on completing Lesson 1! You are now speaking the language of the machine.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn