Tokenization and Context Length: How Gemini Reads

Tokenization and Context Length: How Gemini Reads

Demystifying tokens. Learn how Gemini counts words, images, and videos, and what the massive 1M+ token context window enables.

Tokenization and Context Length

When you pay for Gemini, you pay per Token. But what is a token?

What is a Token?

A token is the fundamental unit of text for an LLM. It is not exactly a word; it is a chunk of characters.

  • Rule of Thumb: 1,000 tokens ≈ 750 words (in English).
  • Example: The word "hamburger" might be 1 token. The word "antidisestablishmentarianism" might be 4 or 5 tokens.

Multimodal Tokens

How do you count the tokens of an image or video?

With Gemini:

  • Images: An image costs a fixed number of tokens (e.g., ~258 tokens per image), regardless of resolution (though it is downscaled internally).
  • Video: Video doesn't have audio tokens + image tokens individually. It samples frames (e.g., 1 frame per second) and converts them. A 1-hour video might be ~700k to 1M tokens depending on the sampling rate.
  • Audio: Audio is tokenized based on duration. 1 minute of audio ≈ X tokens.

The Context Window

The Context Window is the "short-term memory" of the model. It is the amount of text/data you can paste into the prompt right now.

  • GPT-4: ~128k tokens.
  • Gemini 1.5 Pro: 1 Million to 2 Million tokens.

What fits in 1 Million Tokens?

  • ~700,000 words of text (The entire Harry Potter series).
  • ~1 hour of video.
  • ~11 hours of audio.
  • ~30,000 lines of code.

Why Long Context is a Game Changer

In the past, to analyze a huge codebase, you had to use RAG (Retreival Augmented Generation):

  1. Chop code into snippets.
  2. Guess which 5 snippets are relevant.
  3. Send only those 5 snippets to the AI.

Problem: If the bug is caused by a variable defined in File A, modified in File B, and crashed in File C, RAG often misses it because it doesn't see the connection.

With Long Context: You dump all the code into Gemini. It sees everything simultaneously. It works like a human who has read the whole book, not just a few paragraphs.

Caching Context (Cost Optimization)

Sending 1 Million tokens with every API call is expensive and slow. Google introduced Context Caching.

  • Scenario: You have a 500-page user manual. Users ask 1,000 different questions about it.
  • Without Cache: You upload 500 pages (costing $$$) for every single question.
  • With Cache: You upload 500 pages once. You get a cache_id. For subsequent questions, you just pass cache_id + question. You pay a cheap "storage" fee, but avoid the massive "input" fee.
# Conceptual Caching
cache = genai.create_cache(content=huge_document, ttl_minutes=60)

# Fast, cheap calls referencing the cache
response = model.generate_content(prompt="How do I reset?", cache=cache)

Summary

  • Tokens are the currency of LLMs.
  • Gemini's massive context window enables "whole-problem" reasoning.
  • Use Context Caching to make large-context apps economically viable.

In the final lesson of this module, we will learn how to handle the Outputs—streaming, JSON, and safety blocks.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn