
The Conversational Agent: Voice and Audio
Build agents that listen and speak. Master the integration of Whisper (STT) and ElevenLabs (TTS) to create low-latency, empathetic voice experiences.
Voice and Audio Agents
Adding "Voice" to an agent changes the relationship with the user. It moves from "Tool" to "Companion." However, voice systems introduce a new, brutal constraint: Latency. In text, a 2-second delay is fine. In voice, a 2-second delay (the "Uncomfortable Silence") makes the interaction feel broken.
In this lesson, we will learn how to build the "Voice Pipeline" for real-time agents.
1. The Three-Step Pipeline
A voice agent is actually three different models working together:
- Speech-to-Text (STT): Converting audio bytes into text. (e.g., OpenAI Whisper or Deepgram).
- The LLM (Reasoning): Deciding what to say back.
- Text-to-Speech (TTS): Converting the text response into a human-sounding voice. (e.g., ElevenLabs or PlayHT).
2. Managing Latency: The "Streaming" Voice
To avoid the silence, we stream at every stage.
- STT Streaming: The system starts transcribing as soon as the user says the first word. It doesn't wait for them to stop breathing.
- LLM Streaming: The LLM starts outputting words one-by-one.
- TTS Streaming: The voice model takes the first 5 words from the LLM and starts "Synthesizing" audio immediately, before the 6th word is even generated.
Result: You can achieve a "Turn-around time" of < 800ms, which feels like a natural human conversation.
3. Handling "Interruption" (The Barge-In)
What happens if the agent is speaking, and the user says "Stop!"?
- The Problem: Traditional audio players will keep playing the 10-second MP3 until its done.
- The Solution: Full Duplex Communication.
- The microphone is always "Listening." If it detects "Human Speech" while the agent is "Speaking," the orchestrator must send a Kill Signal to the audio playback immediately.
4. Emotional Tone and Prosody
A voice agent should not sound like a robot. Modern TTS models allow you to inject Emotion Tags.
- User says: "I lost my credit card."
- Agent Prompt: "Talk in a sympathetic, urgent tone. Do not be cheerful."
- TTS: Adjusts the pitch and speed of the voice to match the "Sympathetic" instruction.
5. Multi-Lingual Audio
One of the greatest use cases for agents is Real-time Translation.
- Input: User speaks in Spanish.
- Node 1: Translate to English Text.
- Node 2: LLM Reasons in English.
- Node 3: Translate back to Spanish Text.
- Node 4: TTS speaks in a Spanish voice that has the same biometric characteristics as the original user's voice (Voice Cloning).
6. Implementation Example: The ElevenLabs Hook
import elevenlabs
def speak(text, voice_id="premade_voice"):
# Generate audio stream
audio = elevenlabs.generate(
text=text,
voice=voice_id,
model="eleven_multilingual_v2",
stream=True # CRITICAL for latency
)
# Send directly to the user's browser/speaker
elevenlabs.play(audio)
Summary and Mental Model
Think of a Voice Agent like a Radio News Broadcaster.
- They have the Script (LLM).
- They have the Voice (TTS).
- But they also have an Engineer (The Orchestrator) in their ear telling them when to pause, when to speak up, and when to listen to the caller.
Voice is 20% Intelligence and 80% Orchestration.
Exercise: Voice Design
- The Silence: A user asks "What is my bank balance?" (Requires a 2-second tool call).
- How do you "Hide" the silence in a voice app?
- (Hint: "Let me check that for you right now..." - Filler talk).
- Privacy: Should a voice agent always be "Listening" (Always-on mic)?
- What are the Privacy Implications?
- How would you use a "Wake Word" (like 'Hey Agent') to solve this?
- Technical: Why is Deepgram often preferred over Whisper for real-time applications?
- (Hint: Look up "Endpointing" and "Streaming Latency"). Ready for movement? Next lesson: Video Analysis and Interaction.