Voice and Audio Agents

Adding "Voice" to an agent changes the relationship with the user. It moves from "Tool" to "Companion." However, voice systems introduce a new, brutal constraint: Latency. In text, a 2-second delay is fine. In voice, a 2-second delay (the "Uncomfortable Silence") makes the interaction feel broken.

In this lesson, we will learn how to build the "Voice Pipeline" for real-time agents.

1. The Three-Step Pipeline

A voice agent is actually three different models working together:

Speech-to-Text (STT): Converting audio bytes into text. (e.g., OpenAI Whisper or Deepgram).
The LLM (Reasoning): Deciding what to say back.
Text-to-Speech (TTS): Converting the text response into a human-sounding voice. (e.g., ElevenLabs or PlayHT).

2. Managing Latency: The "Streaming" Voice

To avoid the silence, we stream at every stage.

STT Streaming: The system starts transcribing as soon as the user says the first word. It doesn't wait for them to stop breathing.
LLM Streaming: The LLM starts outputting words one-by-one.
TTS Streaming: The voice model takes the first 5 words from the LLM and starts "Synthesizing" audio immediately, before the 6th word is even generated.

Result: You can achieve a "Turn-around time" of < 800ms, which feels like a natural human conversation.

3. Handling "Interruption" (The Barge-In)

What happens if the agent is speaking, and the user says "Stop!"?

The Problem: Traditional audio players will keep playing the 10-second MP3 until its done.
The Solution: Full Duplex Communication.
The microphone is always "Listening." If it detects "Human Speech" while the agent is "Speaking," the orchestrator must send a Kill Signal to the audio playback immediately.

4. Emotional Tone and Prosody

A voice agent should not sound like a robot. Modern TTS models allow you to inject Emotion Tags.

User says: "I lost my credit card."
Agent Prompt: "Talk in a sympathetic, urgent tone. Do not be cheerful."
TTS: Adjusts the pitch and speed of the voice to match the "Sympathetic" instruction.

5. Multi-Lingual Audio

One of the greatest use cases for agents is Real-time Translation.

Input: User speaks in Spanish.
Node 1: Translate to English Text.
Node 2: LLM Reasons in English.
Node 3: Translate back to Spanish Text.
Node 4: TTS speaks in a Spanish voice that has the same biometric characteristics as the original user's voice (Voice Cloning).

6. Implementation Example: The ElevenLabs Hook

import elevenlabs

def speak(text, voice_id="premade_voice"):
    # Generate audio stream
    audio = elevenlabs.generate(
        text=text,
        voice=voice_id,
        model="eleven_multilingual_v2",
        stream=True # CRITICAL for latency
    )
    # Send directly to the user's browser/speaker
    elevenlabs.play(audio)

Summary and Mental Model

Think of a Voice Agent like a Radio News Broadcaster.

They have the Script (LLM).
They have the Voice (TTS).
But they also have an Engineer (The Orchestrator) in their ear telling them when to pause, when to speak up, and when to listen to the caller.

Voice is 20% Intelligence and 80% Orchestration.

Exercise: Voice Design

The Silence: A user asks "What is my bank balance?" (Requires a 2-second tool call).
- How do you "Hide" the silence in a voice app?
- (Hint: "Let me check that for you right now..." - Filler talk).
Privacy: Should a voice agent always be "Listening" (Always-on mic)?
- What are the Privacy Implications?
- How would you use a "Wake Word" (like 'Hey Agent') to solve this?
Technical: Why is Deepgram often preferred over Whisper for real-time applications?
- (Hint: Look up "Endpointing" and "Streaming Latency"). Ready for movement? Next lesson: Video Analysis and Interaction.

The Conversational Agent: Voice and Audio