Building Agents that Hear: Audio Processing and Emotion Detection

Traditionally, building an "Audio Agent" required a complex chain of distinct services:

STT (Speech-to-Text): Convert audio to text (e.g., Whisper, Amazon Transcribe).
LLM: Process the text.
Sentiment Analysis: A separate model to analyze tone.

With the Gemini ADK, this complexity evaporates. Gemini is natively audio-modal. It "listens" to the raw audio waves directly. This allows the model to perceive things that text-only models miss, such as the sarcasm in a voice, the urgency in a customer's tone, or the ambient noise of a busy street in the background.

In this lesson, we will learn how to feed audio to our agents, how to perform speaker identification, and how to utilize "Sonic Reasoning" for advanced agentic tasks.

1. Supported Audio Formats and Limits

Gemini supports all common professional and consumer audio formats:

wav, mp3, aiff, aac, ogg, flac.

The "Hour-Long" Capacity

Thanks to Gemini's large context window, you can send hours of audio in a single request. This is perfect for transcribing long board meetings, podcasts, or legal depositions.

2. Beyond Transcription: The "Soul" of Sound

The most revolutionary part of Gemini's audio capability is Sonic Reasoning.

A. Emotional Intelligence (Sentiment)

A customer service agent can "hear" frustration.

Example: In a recording of a call, the user says "Fine, whatever." A text-only model might think the user is "Fine." Gemini hears the aggressive tone and realizes the user is unhappy.

B. Speaker Diarization (Who Said What?)

Gemini can natively track different voices. You can ask: "List every time the person with the deep voice mentioned the word 'Budget'."

C. Non-Speech Recognition

Gemini understands the world around the words.

Example: "Listen to this recording. Besides the speaking, what else can you hear?"
Response: "I hear a siren in the distance and the sound of someone typing on a mechanical keyboard."

3. Implementation: The "Meeting Analyst" Agent

To process audio, we typically use the File API (as audio files can be large).

import google.generativeai as genai
import time

# 1. Upload the Audio
audio_file = genai.upload_file(path='board_meeting.mp3')

# 2. Wait for processing
while audio_file.state.name == "PROCESSING":
    time.sleep(2)
    audio_file = genai.get_file(audio_file.name)

# 3. Analyze with Gemini
model = genai.GenerativeModel('gemini-1.5-pro')

prompt = [
    "You are a meeting assistant. Listen to this recording and:",
    "1. Provide a verbatim transcript with speaker labels (Speaker A, Speaker B).",
    "2. Identify the main conflict that arose during the discussion.",
    "3. Summarize the tone of the CEO's closing remarks.",
    audio_file
]

response = model.generate_content(prompt)

print(response.text)

4. Architectural Pattern: The "Sonic Guardrail"

In an autonomous phone support agent, you can use audio perception as a Safety Gate.

Listen: The model hears the user is getting extremely angry.
Reason: Agent decides: "The user's emotional state is critical. I should stop autonomous processing."
Act: The agent calls the handoff_to_human_manager tool.

graph TD
    A[Raw Audio Input] --> B[Gemini Reasoner]
    B --> C{Detect Emotion}
    C -->|Angry| D[Escalate to Human]
    C -->|Calm| E[Continue Autonomous Task]
    B --> F[Transcribe Speech]
    F --> G[Extract Task Details]
    G --> E
    
    style C fill:#EA4335,color:#fff

5. Use Case: Disaster Response Agent

Imagine an agent tasked with monitoring emergency radio channels.

Input: Raw radio static and voice.
Task: "Listen for any mention of 'Help' or 'Fire'."
Native Edge: Gemini can filter out the static and sirens natively, focusing on the human voice with much higher accuracy than a standard STT model.

6. Performance Considerations (Tokens and Cost)

In Gemini, audio is converted into tokens.

Rule of Thumb: roughly 1 token per 2 seconds of audio.
An hour of audio is about 1,800 tokens (very cheap!).
Latency: Transcribing an hour-long meeting can take 30-40 seconds of processing time.

7. Limitations of Audio

Heavy Obfuscation: If 5 people are all shouting over each other, Gemini (and humans) will struggle to differentiate them.
Audio Quality: Low-bitrate, highly compressed audio (like an old telephone line) will reduce the accuracy of emotion detection.
No "In-Library" Search: You cannot yet "index" audio for milli-second retrieval without transcribing it first (though this is changing rapidly).

8. Summary and Exercises

Audio agents are the Ears of your system.

Native Processing preserves emotional and ambient context.
Speaker Diarization allows for structured transcription of group talks.
Emotion Detection enables safety-first escalations.
File API is the primary delivery mechanism for audio data.

Exercises

Emotional Analysis: Record yourself speaking a sentence (e.g., "I love this product") in three different ways: Sarcastic, Excited, and Bored. Send the audio to Gemini and ask it to identify the "Mood" of each.
Ambient Intelligence: Record a 30-second clip of a park or a busy street. Ask Gemini to "List all the distinct sounds you hear (birds, wind, cars, footsteps)."
Meeting Transformation: Take an old recording of a call or meeting. Ask Gemini to: "Turn this meeting into a set of Jira tickets in JSON format."

In the next lesson, we combine sight and sound as we explore Building Agents that Watch through video analysis.