
Chunking Transcripts and Videos
Strategies for breaking down temporal data into semantically cohesive and searchable units.
Chunking Transcripts and Videos
Transcripts are inherently linear and temporal. Unlike a book, which has chapters and subheaders, a transcript is often just a long stream of text with time markers. Chunking these effectively requires balancing the "Time" context with the "Topic" context.
Temporal Chunking
The simplest way is to chunk by a fixed duration (e.g., every 60 seconds). This is great for UI (jumping to a timestamp) but poor for semantics if a sentence starts at 0:59 and ends at 1:01.
Semantic Transcript Chunking
A better approach is to use the Full Stop or Speaker Change as the boundary.
def chunk_transcript(segments, max_words=100):
chunks = []
current_chunk = []
current_word_count = 0
for seg in segments:
current_chunk.append(seg['text'])
current_word_count += len(seg['text'].split())
if current_word_count >= max_words:
chunks.append({
"start": segments[0]['start'],
"end": seg['end'],
"text": " ".join(current_chunk)
})
current_chunk = []
current_word_count = 0
return chunks
Scene-Based Video Chunking
If you are indexing video based on visual content, use Scene Detection. A scene change (a cut from a person talking to a slide) is a natural "semantic break".
- Detect Scene Boundaries (using
PySceneDetect). - Collect Text: Gather all transcript text said during that scene.
- Capture Keyframes: Select 1-3 images that represent the scene.
- Create the Chunk: Bind the text, images, and time-range into a single object.
Metadata Requirements
Every video/audio chunk must include:
video_idstart_timestampend_timestampspeaker_labels(if available)
Exercises
- Look at a 10-minute "Talking Head" video. How many semantic topic changes can you identify?
- If you chunked it every 30 seconds, would the "Answer" to a question be split across two chunks?
- What is the benefit of adding an "Overlapping" 5-second window to audio chunks?