Building Agents that Watch: Video and Temporal Reasoning

Video is the richest communication medium we have. It contains visual data, audio data, and—most importantly—Temporal Data (the change of state over time). Monitoring a security camera, editing a highlight reel, or analyzing a surgical procedure all require an agent that doesn't just see a "frame," but understands a "sequence."

With the Gemini ADK, video analysis is a first-class citizen. Because Gemini 1.5 has a context window of 2 million tokens, it doesn't need to "summarize" a video into text first. It can "watch" a 1-hour video natively. In this lesson, we will explore the technical mechanics of video ingestion, the science of temporal reasoning, and how to build agents that generate value from moving images.

1. How Gemini "Watches" Video

Technically, Gemini treats a video as a sequence of Images (Frames) paired with a synchronized Audio stream.

Sampling Rates

To manage the massive amount of data in a video, Gemini doesn't look at every single frame (e.g., 30fps). Instead, it samples the video.

Default Sampling: Typically 1 frame per second (1fps).
Impact: If an event happens extremely quickly (less than 1 second), the model might miss the visual but will likely "hear" it via the audio track.

2. The Power of Temporal Reasoning

Temporal reasoning is the ability to connect an event at Time A with a result at Time B.

A. Causality

"The cup broke at 02:15. What caused it?" Gemini can scan back to 02:14, see a cat jump on the table, and correctly identify the cause-and-effect relationship.

B. Action Recognition

Standard vision models can see a "person" and a "box." Gemini can see "a person lifting a box." It understands the Verb, not just the Noun.

C. Large-Scale Activity Retrieval

You can ask: "In this 45-minute workshop, show me every time the speaker wrote on the whiteboard." Gemini will provide a list of exact timestamps.

3. Implementation: The "Event Logger" Agent

Video MUST use the Google File API. It is too large to send as inline bytes.

import google.generativeai as genai
import time

# 1. Upload the Video
video_file = genai.upload_file(path='factory_floor_safety.mp4')

# 2. Wait for Processing (Video takes longer than images/audio)
while video_file.state.name == "PROCESSING":
    print("Wait for video processing...")
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

# 3. Reasoning with the 'Whole' Video
model = genai.GenerativeModel('gemini-1.5-pro')

prompt = [
    "You are a Safety Auditor. Watch this footage of the warehouse floor.",
    "1. Provide a log of every time a worker enters the 'Hazard Zone' (marked in yellow).",
    "2. Include the timestamp (MM:SS) for each event.",
    "3. Flag any instances where a worker is NOT wearing a helmet.",
    video_file
]

response = model.generate_content(prompt)
print(response.text)

sequenceDiagram
    participant V as Factory Video (30 mins)
    participant G as Gemini Reasoner
    participant S as Safety Log
    
    V->>G: Frame 1... 1800 (1fps)
    V->>G: Continuous Audio Track
    Note over G: Cross-Modal Synthesis
    G->>G: Detects Red Helmet (Visual) at 05:10
    G->>G: Detects Alarm Sound (Audio) at 05:12
    G->>S: "Worker entered zone at 05:10; Alarm triggered at 05:12"

4. Video Use Cases for Agents

Autonomous Video Editing: "Find the most exciting 30 seconds of this football match and describe the goals."
Instructional Support: "Watch this IKEA assembly video. I am stuck at Step 4. What am I doing wrong?" (Requires sending a photo of your current progress alongside the video).
Surveillance & Security: "Alert me if anyone leaves a package unattended for more than 5 minutes on the sidewalk."

5. Token Management and Costs

Video is the most "expensive" modality because of the number of frames.

Calculation: Each frame (image) sampled is roughly 258 tokens.
A 1-minute video at 1fps = 60 frames * 258 = 15,480 tokens.
A 1-hour video = 928,800 tokens (Nearly half of the Gemini Pro context window!).

Strategy: If you are analyzing a long video for a simple fact, it is often cheaper to use Gemini Flash for the initial scan and Gemini Pro for the final deep reasoning.

6. Pro Tip: Prompt Caching for Video

If you are building an agent that needs to answer multiple questions about the same video (e.g., a "Chat with your Video" UI), use Prompt Caching.

You cache the massive video tokens (900k tokens).
Every subsequent question from the user only costs a few hundred tokens.
Result: Lightning-fast, hyper-intelligent video interactions at a 90% discount.

7. Limitations of Video Agents

Fine Temporal Detail: High-speed events (like a bird flapping its wings or a camera flash) might be missed between samples.
Optical Quality: Grainy security footage or dark lighting will reduce accuracy.
Strict Content Filtering: Gemini has strong guardrails against interpreting videos of people in private or sensitive situations.

8. Summary and Exercises

Video agents have the power of Observational Intelligence.

Temporal reasoning connects events in time.
Action recognition identifies verbs and behaviors.
File API handles the large-scale data ingestion.
Prompt Caching makes long-video analysis affordable.

Exercises

Timestamp Retrieval: Upload a short movie trailer to Gemini. Ask: "At what exact second does the title of the movie first appear on screen?"
Activity Comparison: Send a video of someone making coffee and a video of someone making tea. Ask: "What are the 3 biggest differences in the process shown in these two clips?"
Causality Logic: Watch a video of a science experiment (e.g., baking soda volcano). Ask Gemini: "Explain what triggered the reaction and describe the sequence of the explosion using 10-second intervals."

In the next module, we leave the "Inputs" behind and dive into Tool Integration and RAG, learning how to connect our agents to the broader internet and private databases.