Video Preprocessing and Scene Segmentation

Video is the most complex modality in RAG because it combines spatial (visual), temporal (motion), and audio data. To index video, we must first "deconstruct" it.

Keyframe Extraction

We cannot index every single frame (typically 24-60 frames per second). Instead, we extract Keyframes—representative images that capture a significant change.

import cv2

def extract_frames(video_path, gap_seconds=5):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    
    count = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break
        
        if count % (fps * gap_seconds) == 0:
            cv2.imwrite(f"frame_{count}.jpg", frame)
        count += 1

Scene Segmentation

A Scene is a continuous sequence of shots. Breaking video into scenes helps keep chunks semantically meaningful. Libraries like PySceneDetect can automatically detect "cuts" or "fades" in a video.

Combining Visual and Audio Chunks

A video chunk for RAG usually consists of:

The Segmented Audio Transcript for that time period.
Representative Keyframes from that time period.
Motion Metadata (e.g., "fast-paced action" vs. "static talking head").

Reducing Dimensionality

Video files are massive. Preprocessing often involves:

Downsampling: Reducing resolution from 4K to 720p or 480p.
Cropping: Removing black bars (letterboxing).
Temporal Slicing: Only indexing the first 10 minutes of a lecture if that's all that's required.

Tools for Video Preprocessing

FFmpeg: The Swiss Army knife for audio/video manipulation.
OpenCV: For frame analysis and edge detection.
VideoLLMs: (e.g., Video-LLaVA) for summarizing what happened in a clip.

Exercises

Use FFmpeg to extract the audio from a short video clip.
Use OpenCV to extract a frame every 1 second.
Observe how much disk space is saved by keeping only one frame per second vs the full video.