
Storing Image and Video Vectors: The Frame-by-Frame Pipeline
Master the ingestion of visual data. Learn how to convert images and long-form video into searchable vectors without overwhelming your infrastructure.
Storing Image and Video Vectors
As we learned in the previous lesson (CLIP), you can represent an image as a single vector. But how do you handle a 10-minute video? Or a thousand-page PDF of diagrams? Unlike text, which can be easily chunked by sentences, visual data requires a "Temporal" or "Spatial" strategy.
In this lesson, we will build a production pipeline for visual ingestion. We will explore how to sample frames from a video, how to handle "Scene Change Detection," and how to manage the metadata that links a vector back to a specific timestamp in a video file.
1. The Video Ingestion Pipeline
You cannot generate one vector for a whole movie; it would be "meaningless mush." Instead, we treat video as a Sequence of Images.
The Workflow:
- Sampling: Extract one frame every 1-5 seconds.
- Feature Extraction: Run each frame through a model like CLIP.
- Temporal Chunking: Cluster similar frames together into "Scenes."
- Indexing: Store the scene vectors in your database (Pinecone/Chroma) with a timestamp in the metadata.
graph LR
V[Video File] --> S[Sampler: 1fps]
S --> F1[Frame 1s]
S --> F5[Frame 5s]
F1 --> E[CLIP Encoder]
F5 --> E
E --> V1[Vector 1]
E --> V5[Vector 5]
V1 & V5 --> DB[(Vector DB)]
2. Spatial Chunking (Image Cropping)
A single high-resolution photo can contain multiple objects (a dog, a car, a tree). If you embed the whole photo, the vector represents the "whole scene."
Alternative: Region-based Ingestion
- Use an Object Detector (like YOLO) to identify specific objects.
- Crop those objects into sub-images.
- Embed each object separately.
- Store them in the same collection with a
parent_image_id.
Why? This allows a user to search for a "Red car" and find a photo that contains a tiny red car in the background, which a global CLIP vector might have missed.
3. Dealing with Storage: The "Preview" Pattern
Vector databases are terrible at storing raw pixel data. If you store a 5MB JPG in the metadata of every Pinecone vector, your bill will be thousands of dollars.
The Production Strategy:
- Store the Vector in the Vector DB (Pinecone).
- Store the Image File in a Cloud Storage Bucket (AWS S3).
- Store the S3 URL in the Pinecone metadata.
- (Optional) Store a 64x64 Base64 Thumbnail in the metadata for instant UI display.
4. Scene Change Detection: Smarter Sampling
Sampling every 1 second is wasteful if the camera is still for 5 minutes. To save money, we use Scene Change Detection.
We only generate a new vector when the visual content of the frame changes significantly from the previous one. This can reduce your vector count by 90% while maintaining the same search quality for videos.
5. Python Example: Video Frame Ingestion with OpenCV
Here is how you can build a basic sampler in Python.
import cv2
import chromadb
from PIL import Image
# 1. Setup Chroma
client = chromadb.Client()
collection = client.get_or_create_collection("video_index")
def ingest_video(video_path):
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Sample one frame every second
if frame_count % int(fps) == 0:
timestamp = frame_count / fps
# Convert OpenCV (BGR) to PIL (RGB) for CLIP
img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
pil_img = Image.fromarray(img_rgb)
# 2. Generate Vector (Abstracting the CLIP call)
# vector = clip_model.encode(pil_img)
# 3. Add to Database
collection.add(
ids=[f"{video_path}_{timestamp}"],
documents=[f"Frame at {timestamp}s"], # Optional
metadatas={
"src": video_path,
"timestamp": timestamp,
"type": "video_frame"
}
)
frame_count += 1
cap.release()
ingest_video("vacation_video.mp4")
6. Multi-Modal Metadata Design
When storing visual vectors, your metadata needs to be more complex than text:
resolution: [1920, 1080]frame_type: "keyframe" or "sample"detected_objects: ["dog", "human"] (if using YOLO)color_palette: ["#FF0000", "#FFFFFF"]
This allows you to filter your vector search by logic: "Find a video of a dog [Vector] that was filmed in 4K [Metadata Filter]."
Summary and Key Takeaways
Visual ingestion is about managing the Sampling Trade-off.
- Don't embed every frame: Use 1fps sampling or Scene Change Detection.
- Spatial Chunking: Use object detection to index specific details in a large image.
- Reference, Don't Store: Keep images on S3 and metadata URLs in your Vector DB.
- Timestamps are Critical: Without them, your video search is just a "Movie Search," not a "Scene Search."
In the next lesson, we will look at Text-to-Image and Image-to-Image search, exploring the user-facing side of these visual indices.
Exercise: Video Ingestion Strategy
- You are building a "CCTV Search" for a shopping mall.
- You have 100 cameras running 24/7.
- If you sample 1 frame per second, how many vectors will you generate per day?
- How much will that cost in Pinecone if each vector is 512D?
- How would you use Motion Detection to reduce the number of vectors you store? (Hint: Should you ingest footage where nothing is moving?)