Storing Image and Video Vectors: The Frame-by-Frame Pipeline

Storing Image and Video Vectors: The Frame-by-Frame Pipeline

Master the ingestion of visual data. Learn how to convert images and long-form video into searchable vectors without overwhelming your infrastructure.

Storing Image and Video Vectors

As we learned in the previous lesson (CLIP), you can represent an image as a single vector. But how do you handle a 10-minute video? Or a thousand-page PDF of diagrams? Unlike text, which can be easily chunked by sentences, visual data requires a "Temporal" or "Spatial" strategy.

In this lesson, we will build a production pipeline for visual ingestion. We will explore how to sample frames from a video, how to handle "Scene Change Detection," and how to manage the metadata that links a vector back to a specific timestamp in a video file.


1. The Video Ingestion Pipeline

You cannot generate one vector for a whole movie; it would be "meaningless mush." Instead, we treat video as a Sequence of Images.

The Workflow:

  1. Sampling: Extract one frame every 1-5 seconds.
  2. Feature Extraction: Run each frame through a model like CLIP.
  3. Temporal Chunking: Cluster similar frames together into "Scenes."
  4. Indexing: Store the scene vectors in your database (Pinecone/Chroma) with a timestamp in the metadata.
graph LR
    V[Video File] --> S[Sampler: 1fps]
    S --> F1[Frame 1s]
    S --> F5[Frame 5s]
    F1 --> E[CLIP Encoder]
    F5 --> E
    E --> V1[Vector 1]
    E --> V5[Vector 5]
    V1 & V5 --> DB[(Vector DB)]

2. Spatial Chunking (Image Cropping)

A single high-resolution photo can contain multiple objects (a dog, a car, a tree). If you embed the whole photo, the vector represents the "whole scene."

Alternative: Region-based Ingestion

  1. Use an Object Detector (like YOLO) to identify specific objects.
  2. Crop those objects into sub-images.
  3. Embed each object separately.
  4. Store them in the same collection with a parent_image_id.

Why? This allows a user to search for a "Red car" and find a photo that contains a tiny red car in the background, which a global CLIP vector might have missed.


3. Dealing with Storage: The "Preview" Pattern

Vector databases are terrible at storing raw pixel data. If you store a 5MB JPG in the metadata of every Pinecone vector, your bill will be thousands of dollars.

The Production Strategy:

  1. Store the Vector in the Vector DB (Pinecone).
  2. Store the Image File in a Cloud Storage Bucket (AWS S3).
  3. Store the S3 URL in the Pinecone metadata.
  4. (Optional) Store a 64x64 Base64 Thumbnail in the metadata for instant UI display.

4. Scene Change Detection: Smarter Sampling

Sampling every 1 second is wasteful if the camera is still for 5 minutes. To save money, we use Scene Change Detection.

We only generate a new vector when the visual content of the frame changes significantly from the previous one. This can reduce your vector count by 90% while maintaining the same search quality for videos.


5. Python Example: Video Frame Ingestion with OpenCV

Here is how you can build a basic sampler in Python.

import cv2
import chromadb
from PIL import Image

# 1. Setup Chroma
client = chromadb.Client()
collection = client.get_or_create_collection("video_index")

def ingest_video(video_path):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_count = 0
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
            
        # Sample one frame every second
        if frame_count % int(fps) == 0:
            timestamp = frame_count / fps
            
            # Convert OpenCV (BGR) to PIL (RGB) for CLIP
            img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            pil_img = Image.fromarray(img_rgb)
            
            # 2. Generate Vector (Abstracting the CLIP call)
            # vector = clip_model.encode(pil_img)
            
            # 3. Add to Database
            collection.add(
                ids=[f"{video_path}_{timestamp}"],
                documents=[f"Frame at {timestamp}s"], # Optional
                metadatas={
                    "src": video_path,
                    "timestamp": timestamp,
                    "type": "video_frame"
                }
            )
            
        frame_count += 1
    cap.release()

ingest_video("vacation_video.mp4")

6. Multi-Modal Metadata Design

When storing visual vectors, your metadata needs to be more complex than text:

  • resolution: [1920, 1080]
  • frame_type: "keyframe" or "sample"
  • detected_objects: ["dog", "human"] (if using YOLO)
  • color_palette: ["#FF0000", "#FFFFFF"]

This allows you to filter your vector search by logic: "Find a video of a dog [Vector] that was filmed in 4K [Metadata Filter]."


Summary and Key Takeaways

Visual ingestion is about managing the Sampling Trade-off.

  1. Don't embed every frame: Use 1fps sampling or Scene Change Detection.
  2. Spatial Chunking: Use object detection to index specific details in a large image.
  3. Reference, Don't Store: Keep images on S3 and metadata URLs in your Vector DB.
  4. Timestamps are Critical: Without them, your video search is just a "Movie Search," not a "Scene Search."

In the next lesson, we will look at Text-to-Image and Image-to-Image search, exploring the user-facing side of these visual indices.


Exercise: Video Ingestion Strategy

  1. You are building a "CCTV Search" for a shopping mall.
  2. You have 100 cameras running 24/7.
  • If you sample 1 frame per second, how many vectors will you generate per day?
  • How much will that cost in Pinecone if each vector is 512D?
  • How would you use Motion Detection to reduce the number of vectors you store? (Hint: Should you ingest footage where nothing is moving?)

Congratulations! You are now an AI Video Infrastructure engineer.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn