
Video and Document Embeddings: Temporal and Spatial Logic
Master the complexity of embedding video sequences and multi-page documents. Learn to handle temporal sequences and spatial layouts in vector search.
Video and Document Embeddings: Temporal and Spatial Logic
While images and audio are "Single snapshots," Video and Complex Documents (like PDFs with tables) add extra dimensions. A video is a sequence of images (Time), and a document is a sequence of blocks (Space).
In this lesson, we learn how to compress these complex structures into vectors without losing the critical context of "What happened when" or "What was next to what."
1. Video Embeddings: The Temporal challenge
You cannot just embed every frame of a video. A 10-minute video at 30fps would create 18,000 vectors—costly and noisy.
The Strategies:
- Keyframe Extraction: Embed only significant frames (scene changes).
- Temporal Pooling: Use a model like VideoMAE or TimeSformer that analyzes a sequence of frames and produces a single vector representing the "Action" (e.g., "A person opening a door").
- Audio-Visual Fusion: Combine the audio vector (Module 13.2) with the video vector for a more robust "Event" representation.
2. Document Embeddings: Beyond Plain Text
A PDF is not just a string of words. The Layout matters. A number inside a "Total Cost" table row means something different than the same number in a footer.
Models for Document Search:
- LayoutLM: An embedding model that considers the
(x, y)coordinates of words. It embeds the "Visual structure" of the page. - ColPali: A recent breakthrough that treats document pages as images but allows text-based retrieval, preserving every table and chart perfectly.
3. Implementation: Video Search logic
def process_video_segments(video_path, segment_seconds=5):
# 1. Chunk video into smaller clips
clips = chunk_video(video_path, segment_seconds)
vectors = []
for clip in clips:
# 2. Use a model like 'Video-CLIP' to get a single vector for the clip
v = video_clip_model.encode(clip)
vectors.append(v)
return vectors # Stored in Vector DB with 'video_id' and 'timestamp'
4. Metadata Strategy: The "Parent-Child" Relationship
For both Video and Documents, you must store Parent Metadata.
- Vector 1: "Action: Person surfing" (Timestamp: 0:10).
- Metadata:
{"parent_video": "surf_vlog.mp4", "start_time": 10, "end_time": 15}.
This allows the user to search for "Surfing" and be taken to the exact timestamp in the source video.
5. Summary and Key Takeaways
- Dimensionality: Video adds Time; Documents add Layout.
- Pruning is Essential: Don't embed every pixel/word; embed "Sections" or "Keyframes."
- Layout Models: Use models like LayoutLM for professional document retrieval.
- Link to Source: Metadata must include the exact timestamp or page number for the search results to be useful.
In the next lesson, we’ll see the ultimate payoff: Cross-Modal Retrieval.