Multi-Modal UX: Beyond the Keyboard

The future of agents is not just text. It’s Vision (seeing your screen or a photo) and Audio (interpreting your voice and tone). This expands the UX from a "Chat" to a "Sensory Experience."

1. Vision: The "Look at This" Interaction

Users should be able to "Drag and Drop" any visual into the agent.

UI Element: A screen-capture button or a photo uploader.
Agent Action: "Analyze this chart from my electricity bill."
UX Challenge: The agent must explain where in the image it saw the data (using coordinates or bounding boxes).

2. Audio: The "Voice Primary" Interface

Voice agents (like those powered by OpenAI's Realtime API) have Zero Latency.

UX Requirement: You need a Visual Waverform to show the AI is listening.
Interruption Support: The UX must handle the user "Talking over" the agent. In a chat box, this is impossible, but in voice, it’s natural.

3. Visualizing the Sensory Input

graph TD
    User[Human] -->|Voice| A[Audio Component]
    User -->|Screenshot| V[Vision Component]
    A --> Master[Master Agent Brain]
    V --> Master
    Master -->|Action| Tool[Tool Execution]
    Master -->|Response| Voice[Speech Synthesis]
    Master -->|Response| UI[Visual Bounding Box on Image]

4. Multi-Modal Accessibility

Multi-modal agents are a superpower for accessibility.

Blind Users: The agent "sees" the screen and describes the UI.
Deaf Users: The agent "listens" to a video call and provides a real-time, context-aware summary.

5. Engineering Tip: Vision Tokens are Expensive

Processing a high-resolution image costs more tokens than processing a page of text.

The Optimization: Use a small, low-res model for "Glint" (is there an image here?) and only wake up the expensive "Vision" model when the user asks a specific question about the image details.

Key Takeaways

Vision allows agents to interact with the world of "Unstructured Visuals."
Audio (Real-time) requires specialized low-latency protocols like WebSockets.
Feedback must be multi-modal (if the user speaks, the agent should speak).
Bounding boxes are the "Citations" for vision-based accuracy.

Module 16 Lesson 5: Multi-Modal Agents