
Video Preprocessing and Scene Segmentation
Learn how to break video files into meaningful scenes and keyframes for efficient indexing.
Video Preprocessing and Scene Segmentation
Video is the most complex modality in RAG because it combines spatial (visual), temporal (motion), and audio data. To index video, we must first "deconstruct" it.
Keyframe Extraction
We cannot index every single frame (typically 24-60 frames per second). Instead, we extract Keyframes—representative images that capture a significant change.
import cv2
def extract_frames(video_path, gap_seconds=5):
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret: break
if count % (fps * gap_seconds) == 0:
cv2.imwrite(f"frame_{count}.jpg", frame)
count += 1
Scene Segmentation
A Scene is a continuous sequence of shots. Breaking video into scenes helps keep chunks semantically meaningful.
Libraries like PySceneDetect can automatically detect "cuts" or "fades" in a video.
Combining Visual and Audio Chunks
A video chunk for RAG usually consists of:
- The Segmented Audio Transcript for that time period.
- Representative Keyframes from that time period.
- Motion Metadata (e.g., "fast-paced action" vs. "static talking head").
Reducing Dimensionality
Video files are massive. Preprocessing often involves:
- Downsampling: Reducing resolution from 4K to 720p or 480p.
- Cropping: Removing black bars (letterboxing).
- Temporal Slicing: Only indexing the first 10 minutes of a lecture if that's all that's required.
Tools for Video Preprocessing
- FFmpeg: The Swiss Army knife for audio/video manipulation.
- OpenCV: For frame analysis and edge detection.
- VideoLLMs: (e.g., Video-LLaVA) for summarizing what happened in a clip.
Exercises
- Use
FFmpegto extract the audio from a short video clip. - Use
OpenCVto extract a frame every 1 second. - Observe how much disk space is saved by keeping only one frame per second vs the full video.