
Moving Pictures: Video Analysis and Interaction
Master the temporal dimension of AI. Learn how to build agents that understand events over time, detect anomalies in security footage, and summarize long-form video content.
Video Analysis and Interaction
Video is the most "Information-Dense" medium in the world. It is a sequence of images (the Spatial dimension) combined with a sequence of events (the Temporal dimension). For an agent to understand video, it doesn't just need to "See" a frame; it needs to "Remember" what happened 5 seconds ago to understand the context.
In this lesson, we will learn how to build Video-Aware agents—from high-level summarization to real-time anomaly detection.
1. The Challenge: Temporal Reasoning
If a model sees a frame of a "Falling Glass," it doesn't know if the glass is about to break or if it has already fallen. It needs to see the sequence.
Pattern: The "Temporal Window"
Instead of sending 1 image, we send a Summary Buffer.
- Frame 1: Glass on table.
- Frame 5: Hand touches glass.
- Frame 10: Glass in mid-air.
By grouping these three frames into a single "Vision Request," the model can deduce Action and Intent.
2. Technical Implementation: "Keyframe Sampling"
Sending 30 frames per second (FPS) to an LLM like Gemini 1.5 Pro is impossible and unnecessary. We use Keyframe Sampling.
The Sampling Strategy
- Low Motion: Capture 1 frame every 5 seconds.
- High Motion (Detected via OpenCV): Capture 5 frames per second.
- The Buffer: Only send the last 10 "Important" frames to the LLM.
3. Tool: Gemini 1.5 Pro and Large Context Video
Gemini 1.5 Pro has revolutionized video agency because of its 1M+ token window. You can literally upload a 1-hour movie file, and the model treats the pixels like tokens.
- Usage: "Find the exact minute when the person in the red hat entered the room."
- Response: "00:45:12. They entered through the side door."
4. Video-as-a-Tool (Surveillance Agency)
A production agent can act as a Virtual Security Guard.
- Node 1 (Fast Detector): A local
YOLO(You Only Look Once) model scans a video stream for "Persons." (Low cost, low intelligence). - Trigger: If a person is detected, the agent is "Woken up."
- Node 2 (High Intelligence): The Agent (Claude 3.5 / GPT-4o) analyzes the video: "Is this person authorized to be here at 3 AM?"
- Action: If no, call
sound_alarm()ornotify_police().
5. Privacy and Ethical Video Scrubbing
Video is highly sensitive. If you are analyzing a video of an office, you might capture computer screens with passwords or private faces.
The "Blur-on-Ingest" Pattern
Before the video hits the AI:
- Use a local library (MediaPipe) to detect Faces and Text.
- Apply a Gaussian blur.
- Only then send the "Anonymized" video to the reasoning engine.
6. Implementation Example: Extracting Frames with OpenCV
import cv2
def extract_key_frames(video_path, every_n_seconds=2):
vidcap = cv2.VideoCapture(video_path)
fps = vidcap.get(cv2.CAP_PROP_FPS)
frames = []
count = 0
while vidcap.isOpened():
success, image = vidcap.read()
if not success: break
if count % (fps * every_n_seconds) == 0:
frames.append(to_base64(image))
count += 1
return frames
Summary and Mental Model
Think of Video Analysis like Flipping through a Book.
- If you read every single page, it takes too long (30 FPS).
- If you look at the Pictures on every 10th page, you can understand the whole story in 1 minute.
The "Art" of video agency is knowing which pages to flip to.
Exercise: Video Reasoning
- Anomaly Detection: You are building an agent to monitor a Parking Lot.
- What are the "Key Frames" you would send to the model to detect a "Car Accident"?
- (Hint: Do you need the 5 minutes of footage before the crash?)
- Compression: A 1-minute video is 100MB.
- How do you reduce it to a size that an LLM can accept?
- (Hint: Focus on "Downscaling" and "Frame Rate" reduction).
- Multi-Modal: How would you combine Video and Audio for an agent that is "Auditing a Sales Call"?
- Which is more important for detecting "User Frustration": the words they say, or the expression on their face? Ready to put it all together? Next lesson: Combining Modalities in a Single Graph.