Tokenization and Context Length

When you pay for Gemini, you pay per Token. But what is a token?

What is a Token?

A token is the fundamental unit of text for an LLM. It is not exactly a word; it is a chunk of characters.

Rule of Thumb: 1,000 tokens ≈ 750 words (in English).
Example: The word "hamburger" might be 1 token. The word "antidisestablishmentarianism" might be 4 or 5 tokens.

Multimodal Tokens

How do you count the tokens of an image or video?

With Gemini:

Images: An image costs a fixed number of tokens (e.g., ~258 tokens per image), regardless of resolution (though it is downscaled internally).
Video: Video doesn't have audio tokens + image tokens individually. It samples frames (e.g., 1 frame per second) and converts them. A 1-hour video might be ~700k to 1M tokens depending on the sampling rate.
Audio: Audio is tokenized based on duration. 1 minute of audio ≈ X tokens.

The Context Window

The Context Window is the "short-term memory" of the model. It is the amount of text/data you can paste into the prompt right now.

GPT-4: ~128k tokens.
Gemini 1.5 Pro: 1 Million to 2 Million tokens.

What fits in 1 Million Tokens?

~700,000 words of text (The entire Harry Potter series).
~1 hour of video.
~11 hours of audio.
~30,000 lines of code.

Why Long Context is a Game Changer

In the past, to analyze a huge codebase, you had to use RAG (Retreival Augmented Generation):

Chop code into snippets.
Guess which 5 snippets are relevant.
Send only those 5 snippets to the AI.

Problem: If the bug is caused by a variable defined in File A, modified in File B, and crashed in File C, RAG often misses it because it doesn't see the connection.

With Long Context: You dump all the code into Gemini. It sees everything simultaneously. It works like a human who has read the whole book, not just a few paragraphs.

Caching Context (Cost Optimization)

Sending 1 Million tokens with every API call is expensive and slow. Google introduced Context Caching.

Scenario: You have a 500-page user manual. Users ask 1,000 different questions about it.
Without Cache: You upload 500 pages (costing $$$) for every single question.
With Cache: You upload 500 pages once. You get a cache_id. For subsequent questions, you just pass cache_id + question. You pay a cheap "storage" fee, but avoid the massive "input" fee.

# Conceptual Caching
cache = genai.create_cache(content=huge_document, ttl_minutes=60)

# Fast, cheap calls referencing the cache
response = model.generate_content(prompt="How do I reset?", cache=cache)

Summary

Tokens are the currency of LLMs.
Gemini's massive context window enables "whole-problem" reasoning.
Use Context Caching to make large-context apps economically viable.

In the final lesson of this module, we will learn how to handle the Outputs—streaming, JSON, and safety blocks.

Tokenization and Context Length: How Gemini Reads