
Handling Large Multimodal Assets
Strategies for processing and storing multi-gigabyte files efficiently in a RAG ingestion pipeline.
Handling Large Multimodal Assets
A typical text file is a few kilobytes. A 4K video can be several gigabytes. Handling these large files in a RAG pipeline requires specialized infrastructure and strategies.
Storage vs. Indexing
Never store large raw files in your vector database.
- Store the Embeddings and Metadata in the vector DB (e.g., Chroma).
- Store the Raw Files in an object store (e.g., AWS S3 or a local NAS).
- Store a Reference (URL or Path) to the raw file in the metadata.
Streaming Processing
For very large videos or multi-thousand-page PDFs, don't load the whole file into RAM. Use Streaming Readers.
import cv2
def stream_video(video_path):
cap = cv2.VideoCapture(video_path)
while cap.isOpened():
ret, frame = cap.read()
if not ret: break
# Process frame-by-frame
cap.release()
Parallel Ingestion
A single thread processing a 1GB video will be slow. Split the file and process segments in parallel.
- Video: Split into 5-minute segments using FFmpeg.
- PDF: Process pages 1-100 on Worker A, 101-200 on Worker B.
Deduplication at Scale
If you have 10,000 images, many might be duplicates or near-duplicates.
- Hash-based: MD5/SHA256 for exact duplicates.
- Perceptual Hashing (pHash): For "visually similar" images (e.g., same image with different resolution).
Caching Intermediate Outputs
OCR and Audio Transcription are expensive. If the ingestion pipeline fails halfway, you don't want to re-transcribe the same audio file.
- Cache the transcript JSON in S3/Redis.
- Use a State Machine (e.g., AWS Step Functions or Airflow) to track progress.
Exercises
- Calculate the storage required to index 1,000 one-hour videos at 1080p.
- If you only store 1 frame every 5 seconds, how much "visual" storage do you save compared to the raw video?
- What is the RAM footprint of loading a 500MB PDF into memory vs. processing it page-by-page?