Chunking Transcripts and Videos

Transcripts are inherently linear and temporal. Unlike a book, which has chapters and subheaders, a transcript is often just a long stream of text with time markers. Chunking these effectively requires balancing the "Time" context with the "Topic" context.

Temporal Chunking

The simplest way is to chunk by a fixed duration (e.g., every 60 seconds). This is great for UI (jumping to a timestamp) but poor for semantics if a sentence starts at 0:59 and ends at 1:01.

Semantic Transcript Chunking

A better approach is to use the Full Stop or Speaker Change as the boundary.

def chunk_transcript(segments, max_words=100):
    chunks = []
    current_chunk = []
    current_word_count = 0
    
    for seg in segments:
        current_chunk.append(seg['text'])
        current_word_count += len(seg['text'].split())
        
        if current_word_count >= max_words:
            chunks.append({
                "start": segments[0]['start'],
                "end": seg['end'],
                "text": " ".join(current_chunk)
            })
            current_chunk = []
            current_word_count = 0
            
    return chunks

Scene-Based Video Chunking

If you are indexing video based on visual content, use Scene Detection. A scene change (a cut from a person talking to a slide) is a natural "semantic break".

Detect Scene Boundaries (using PySceneDetect).
Collect Text: Gather all transcript text said during that scene.
Capture Keyframes: Select 1-3 images that represent the scene.
Create the Chunk: Bind the text, images, and time-range into a single object.

Metadata Requirements

Every video/audio chunk must include:

video_id
start_timestamp
end_timestamp
speaker_labels (if available)

Exercises

Look at a 10-minute "Talking Head" video. How many semantic topic changes can you identify?
If you chunked it every 30 seconds, would the "Answer" to a question be split across two chunks?
What is the benefit of adding an "Overlapping" 5-second window to audio chunks?