
Tokenization and Context Length: How Gemini Reads
Demystifying tokens. Learn how Gemini counts words, images, and videos, and what the massive 1M+ token context window enables.
Tokenization and Context Length
When you pay for Gemini, you pay per Token. But what is a token?
What is a Token?
A token is the fundamental unit of text for an LLM. It is not exactly a word; it is a chunk of characters.
- Rule of Thumb: 1,000 tokens ≈ 750 words (in English).
- Example: The word "hamburger" might be 1 token. The word "antidisestablishmentarianism" might be 4 or 5 tokens.
Multimodal Tokens
How do you count the tokens of an image or video?
With Gemini:
- Images: An image costs a fixed number of tokens (e.g., ~258 tokens per image), regardless of resolution (though it is downscaled internally).
- Video: Video doesn't have audio tokens + image tokens individually. It samples frames (e.g., 1 frame per second) and converts them. A 1-hour video might be ~700k to 1M tokens depending on the sampling rate.
- Audio: Audio is tokenized based on duration. 1 minute of audio ≈ X tokens.
The Context Window
The Context Window is the "short-term memory" of the model. It is the amount of text/data you can paste into the prompt right now.
- GPT-4: ~128k tokens.
- Gemini 1.5 Pro: 1 Million to 2 Million tokens.
What fits in 1 Million Tokens?
- ~700,000 words of text (The entire Harry Potter series).
- ~1 hour of video.
- ~11 hours of audio.
- ~30,000 lines of code.
Why Long Context is a Game Changer
In the past, to analyze a huge codebase, you had to use RAG (Retreival Augmented Generation):
- Chop code into snippets.
- Guess which 5 snippets are relevant.
- Send only those 5 snippets to the AI.
Problem: If the bug is caused by a variable defined in File A, modified in File B, and crashed in File C, RAG often misses it because it doesn't see the connection.
With Long Context: You dump all the code into Gemini. It sees everything simultaneously. It works like a human who has read the whole book, not just a few paragraphs.
Caching Context (Cost Optimization)
Sending 1 Million tokens with every API call is expensive and slow. Google introduced Context Caching.
- Scenario: You have a 500-page user manual. Users ask 1,000 different questions about it.
- Without Cache: You upload 500 pages (costing $$$) for every single question.
- With Cache: You upload 500 pages once. You get a
cache_id. For subsequent questions, you just passcache_id+question. You pay a cheap "storage" fee, but avoid the massive "input" fee.
# Conceptual Caching
cache = genai.create_cache(content=huge_document, ttl_minutes=60)
# Fast, cheap calls referencing the cache
response = model.generate_content(prompt="How do I reset?", cache=cache)
Summary
- Tokens are the currency of LLMs.
- Gemini's massive context window enables "whole-problem" reasoning.
- Use Context Caching to make large-context apps economically viable.
In the final lesson of this module, we will learn how to handle the Outputs—streaming, JSON, and safety blocks.