Audio Preprocessing and Transcription

Audio is an invaluable but messy source of data. From podcast episodes to recorded meetings, the goal of audio RAG is to make the "spoken word" searchable and linkable back to the exact second it was said.

Audio Cleaning and Normalization

Raw audio often has "hiss", background noise, or variable volume levels.

Key Steps

Denoising: Use libraries like noisereduce to strip background static.
Normalizing: Adjust the volume to a consistent level across files.
Resampling: Convert to a standard sample rate (e.g., 1 6kHz) required by models like OpenAI Whisper.

import librosa
import soundfile as sf

def clean_audio(input_file, output_file):
    y, sr = librosa.load(input_file, sr=16000)
    # Perform normalization logic here
    sf.write(output_file, y, sr)

Transcription with Whisper

Whisper (by OpenAI) is the gold standard for open-source transcription.

import whisper

model = whisper.load_model("base")
result = model.transcribe("meeting.mp3")
print(result["text"])

Advanced Whisper Features

Timestamps: Essential for linking search hits to the video/audio player.
Language Detection: Automatically identifying the spoken language.
Translation: Transcribing from Spanish to English in a single step.

Diarization: "Who Spoke When?"

In a meeting, it's not enough to know what was said. You need to know who said it. Speaker Diarization labels text with speaker IDs (e.g., "Speaker A", "Speaker B").

Creating Searchable Audio Chunks

Unlike text sections, audio chunks are defined by time.

Fixed-Time Window: Every 30 seconds.
Semantic Window: Break at pauses or speaker changes.

Exercises

Record 10 seconds of audio with background noise (like a fan).
Use the Python whisper library to transcribe it.
How much did the noise affect the accuracy?