
What Tokens Are and How They Are Counted
Discover the fundamental building blocks of LLM communication. Learn how text is transformed into tokens, why character counts don't equal token counts, and how to master the tokenization process.
What Tokens Are and How They Are Counted
Welcome to the first lesson of the Token Efficiency in LLM Use, Agentic AI, and Beyond course. Before we can optimize our AI systems for cost and performance, we must understand the "currency" of Large Language Models: Tokens.
In this lesson, we will peel back the layers of how machines read human language. We’ll move beyond the simplistic idea that "words are data" and explore the mathematical reality of tokenization.
1. The Core Definition: What is a Token?
A token is the basic unit of text that a Large Language Model (LLM) processes. If you think of an LLM as a highly advanced statistical engine, tokens are the individual pieces of information it uses to predict the next piece of information.
However, a token is not necessarily a word. It can be a single character, a part of a word (sub-word), or even a combination of punctuation and whitespace.
The rule of thumb
For English text:
- 1 token ≈ 0.75 words
- 100 tokens ≈ 75 words
- 1 white space = usually its own token or part of the following word
Why not just use words?
If we used whole words, our "vocabulary" would be infinite. New words are created every day, and languages like German create massive compound words. By using sub-word tokens (like Byte Pair Encoding or BPE), models can represent any word in existence using a finite set of building blocks (usually between 32,000 and 128,000 unique tokens).
2. How Mapping Works: From Text to ID
When you send a prompt to AWS Bedrock or OpenAI, the first thing that happens is Tokenization. Your raw string is converted into a list of integers.
graph LR
A[Raw Text: 'Hello world!'] --> B[Tokenizer]
B --> C[Token IDs: 15496, 995, 0]
C --> D[Model Embedding Layer]
D --> E[Mathematical Vector]
Each integer corresponds to a specific token in the model's vocabulary. For example, in the GPT-4 vocabulary:
"The"might be464" the"(with a space) might be262
This distinction is crucial: Whitespace matters. Extra spaces in your prompts are not just "blank"; they are consumed tokens that you pay for.
3. Counting Tokens in Python
To build production-grade applications, you cannot guess your token count. You must use tools like tiktoken (for OpenAI) or the transformers library (for Anthropic/Meta models).
Python Practice: Token Counting with Tiktoken
Here is how you can programmatically check the token count of a string before sending it to an API.
import tiktoken
def count_tokens(text: str, model="gpt-4") -> int:
"""
Returns the number of tokens in a text string.
"""
# Load the encoding for the specific model
encoding = tiktoken.encoding_for_model(model)
# Encode the text into token IDs
tokens = encoding.encode(text)
# The length of the list is our token count
return len(tokens)
# Example usage
prompt = "Token efficiency is the secret to scalable AI."
num_tokens = count_tokens(prompt)
print(f"Text: '{prompt}'")
print(f"Token Count: {num_tokens}")
# Visualizing the tokens
encoding = tiktoken.encoding_for_model("gpt-4")
token_strings = [encoding.decode([t]) for t in encoding.encode(prompt)]
print(f"Tokens: {token_strings}")
Why this matters for FastAPI
If you are building an API that proxies LLM requests (common in enterprise middleware), you should count tokens before hitting the provider. This allows you to:
- Reject requests that exceed a user's budget.
- Route long prompts to models with larger context windows.
- Cache responses based on token fingerprints.
4. Sub-word Tokenization and Rare Words
Common words like "apple" are usually a single token. However, rare words, technical jargon, or code snippets are often broken into many small tokens.
Example: antigravity
A model might see this as:
antigravity
Example: 0.00000001
Numbers are notoriously token-heavy. While 100 might be one token, 123,456.78 might be split into 4 or 5 tokens depending on the commas and decimals.
The Impact on Cost
If your application processes medical documents or legal contracts filled with rare terminology, your "word-to-token" ratio will be much higher than 0.75. You might find that 1,000 words of legal text consume 2,000 tokens, doubling your expected cost.
5. Visualizing the Tokenization Process
Understanding the "boundary" of tokens is a superpower for prompt engineers.
graph TD
subgraph "Tokenization of 'Thinking...'"
T1[Token 1: 'Thin']
T2[Token 2: 'king']
T3[Token 3: '...']
end
subgraph "Tokenization of ' Thinking '"
T4[Token 1: ' Thinking']
T5[Token 2: ' ']
end
Notice how a leading space often gets merged into the word, but a trailing space often stands alone. These "Ghost Tokens" can add up in large-scale agentic loops.
6. Tokenization in AWS Bedrock
When using AWS Bedrock, different models use different tokenizers. For example, Claude 3 (Anthropic) uses a different vocabulary than Llama 3 (Meta).
If you are building a multi-model application using LangChain, you should use the get_token_ids method to ensure you are accurately measuring usage across different providers.
AWS Bedrock Example (Python SDK)
import boto3
import json
# Initialize the Bedrock client
bedrock = boto3.client(service_name='bedrock-runtime')
def invoke_efficiently(prompt):
# Note: Bedrock doesn't always return token counts in the
# immediate response body for all models. Standardizing this
# in your middleware is essential.
body = json.dumps({
"prompt": prompt,
"max_tokens_to_sample": 200,
"temperature": 0.5,
})
response = bedrock.invoke_model(
body=body,
modelId="anthropic.claude-v2"
)
# Post-processing: Calculate tokens manually or use
# Amazon CloudWatch metrics for precise accounting.
return response
7. The Performance Trade-off
Counting tokens is not free. It adds a few milliseconds of latency to your request. However, in the context of an LLM call which might take 2-10 seconds, the overhead of a local tokenizer is negligible (microseconds).
Senior Engineer Advice: Always count tokens locally before sending a request. It is the only way to build a reliable, cost-aware system.
Summary and Key Takeaways
- Tokens are math, not language: Models don't see words; they see numerical IDs.
- 1,000 tokens ≈ 750 words: Use this for quick estimations, but code for precision.
- Sub-word splits: Rare words and code consume more tokens than common English.
- Validation is key: Use libraries like
tiktokento validate prompt size before making expensive API calls.
In the next lesson, we will dive into the economics of Input vs. Output Tokens and why the "direction" of the data changes the price you pay.
Exercise: The Tokenizer Test
- Predict how many tokens are in the phrase:
"LLM orchestration is complex." - Run the Python
tiktokenscript provided above to verify. - Add three spaces between
"is"and"complex"and see how the token count changes.