Chunking Transcripts and Videos

Chunking Transcripts and Videos

Strategies for breaking down temporal data into semantically cohesive and searchable units.

Chunking Transcripts and Videos

Transcripts are inherently linear and temporal. Unlike a book, which has chapters and subheaders, a transcript is often just a long stream of text with time markers. Chunking these effectively requires balancing the "Time" context with the "Topic" context.

Temporal Chunking

The simplest way is to chunk by a fixed duration (e.g., every 60 seconds). This is great for UI (jumping to a timestamp) but poor for semantics if a sentence starts at 0:59 and ends at 1:01.

Semantic Transcript Chunking

A better approach is to use the Full Stop or Speaker Change as the boundary.

def chunk_transcript(segments, max_words=100):
    chunks = []
    current_chunk = []
    current_word_count = 0
    
    for seg in segments:
        current_chunk.append(seg['text'])
        current_word_count += len(seg['text'].split())
        
        if current_word_count >= max_words:
            chunks.append({
                "start": segments[0]['start'],
                "end": seg['end'],
                "text": " ".join(current_chunk)
            })
            current_chunk = []
            current_word_count = 0
            
    return chunks

Scene-Based Video Chunking

If you are indexing video based on visual content, use Scene Detection. A scene change (a cut from a person talking to a slide) is a natural "semantic break".

  1. Detect Scene Boundaries (using PySceneDetect).
  2. Collect Text: Gather all transcript text said during that scene.
  3. Capture Keyframes: Select 1-3 images that represent the scene.
  4. Create the Chunk: Bind the text, images, and time-range into a single object.

Metadata Requirements

Every video/audio chunk must include:

  • video_id
  • start_timestamp
  • end_timestamp
  • speaker_labels (if available)

Exercises

  1. Look at a 10-minute "Talking Head" video. How many semantic topic changes can you identify?
  2. If you chunked it every 30 seconds, would the "Answer" to a question be split across two chunks?
  3. What is the benefit of adding an "Overlapping" 5-second window to audio chunks?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn