Handling Large Multimodal Assets

A typical text file is a few kilobytes. A 4K video can be several gigabytes. Handling these large files in a RAG pipeline requires specialized infrastructure and strategies.

Storage vs. Indexing

Never store large raw files in your vector database.

Store the Embeddings and Metadata in the vector DB (e.g., Chroma).
Store the Raw Files in an object store (e.g., AWS S3 or a local NAS).
Store a Reference (URL or Path) to the raw file in the metadata.

Streaming Processing

For very large videos or multi-thousand-page PDFs, don't load the whole file into RAM. Use Streaming Readers.

import cv2

def stream_video(video_path):
    cap = cv2.VideoCapture(video_path)
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break
        # Process frame-by-frame
    cap.release()

Parallel Ingestion

A single thread processing a 1GB video will be slow. Split the file and process segments in parallel.

Video: Split into 5-minute segments using FFmpeg.
PDF: Process pages 1-100 on Worker A, 101-200 on Worker B.

Deduplication at Scale

If you have 10,000 images, many might be duplicates or near-duplicates.

Hash-based: MD5/SHA256 for exact duplicates.
Perceptual Hashing (pHash): For "visually similar" images (e.g., same image with different resolution).

Caching Intermediate Outputs

OCR and Audio Transcription are expensive. If the ingestion pipeline fails halfway, you don't want to re-transcribe the same audio file.

Cache the transcript JSON in S3/Redis.
Use a State Machine (e.g., AWS Step Functions or Airflow) to track progress.

Exercises

Calculate the storage required to index 1,000 one-hour videos at 1080p.
If you only store 1 frame every 5 seconds, how much "visual" storage do you save compared to the raw video?
What is the RAM footprint of loading a 500MB PDF into memory vs. processing it page-by-page?