The Conversational Agent: Voice and Audio

The Conversational Agent: Voice and Audio

Build agents that listen and speak. Master the integration of Whisper (STT) and ElevenLabs (TTS) to create low-latency, empathetic voice experiences.

Voice and Audio Agents

Adding "Voice" to an agent changes the relationship with the user. It moves from "Tool" to "Companion." However, voice systems introduce a new, brutal constraint: Latency. In text, a 2-second delay is fine. In voice, a 2-second delay (the "Uncomfortable Silence") makes the interaction feel broken.

In this lesson, we will learn how to build the "Voice Pipeline" for real-time agents.


1. The Three-Step Pipeline

A voice agent is actually three different models working together:

  1. Speech-to-Text (STT): Converting audio bytes into text. (e.g., OpenAI Whisper or Deepgram).
  2. The LLM (Reasoning): Deciding what to say back.
  3. Text-to-Speech (TTS): Converting the text response into a human-sounding voice. (e.g., ElevenLabs or PlayHT).

2. Managing Latency: The "Streaming" Voice

To avoid the silence, we stream at every stage.

  • STT Streaming: The system starts transcribing as soon as the user says the first word. It doesn't wait for them to stop breathing.
  • LLM Streaming: The LLM starts outputting words one-by-one.
  • TTS Streaming: The voice model takes the first 5 words from the LLM and starts "Synthesizing" audio immediately, before the 6th word is even generated.

Result: You can achieve a "Turn-around time" of < 800ms, which feels like a natural human conversation.


3. Handling "Interruption" (The Barge-In)

What happens if the agent is speaking, and the user says "Stop!"?

  • The Problem: Traditional audio players will keep playing the 10-second MP3 until its done.
  • The Solution: Full Duplex Communication.
  • The microphone is always "Listening." If it detects "Human Speech" while the agent is "Speaking," the orchestrator must send a Kill Signal to the audio playback immediately.

4. Emotional Tone and Prosody

A voice agent should not sound like a robot. Modern TTS models allow you to inject Emotion Tags.

  • User says: "I lost my credit card."
  • Agent Prompt: "Talk in a sympathetic, urgent tone. Do not be cheerful."
  • TTS: Adjusts the pitch and speed of the voice to match the "Sympathetic" instruction.

5. Multi-Lingual Audio

One of the greatest use cases for agents is Real-time Translation.

  • Input: User speaks in Spanish.
  • Node 1: Translate to English Text.
  • Node 2: LLM Reasons in English.
  • Node 3: Translate back to Spanish Text.
  • Node 4: TTS speaks in a Spanish voice that has the same biometric characteristics as the original user's voice (Voice Cloning).

6. Implementation Example: The ElevenLabs Hook

import elevenlabs

def speak(text, voice_id="premade_voice"):
    # Generate audio stream
    audio = elevenlabs.generate(
        text=text,
        voice=voice_id,
        model="eleven_multilingual_v2",
        stream=True # CRITICAL for latency
    )
    # Send directly to the user's browser/speaker
    elevenlabs.play(audio)

Summary and Mental Model

Think of a Voice Agent like a Radio News Broadcaster.

  • They have the Script (LLM).
  • They have the Voice (TTS).
  • But they also have an Engineer (The Orchestrator) in their ear telling them when to pause, when to speak up, and when to listen to the caller.

Voice is 20% Intelligence and 80% Orchestration.


Exercise: Voice Design

  1. The Silence: A user asks "What is my bank balance?" (Requires a 2-second tool call).
    • How do you "Hide" the silence in a voice app?
    • (Hint: "Let me check that for you right now..." - Filler talk).
  2. Privacy: Should a voice agent always be "Listening" (Always-on mic)?
    • What are the Privacy Implications?
    • How would you use a "Wake Word" (like 'Hey Agent') to solve this?
  3. Technical: Why is Deepgram often preferred over Whisper for real-time applications?
    • (Hint: Look up "Endpointing" and "Streaming Latency"). Ready for movement? Next lesson: Video Analysis and Interaction.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn