Module 16 Lesson 5: Multi-Modal Agents
See and hear. Designing interfaces that allow agents to process images, diagrams, and voice commands.
Multi-Modal UX: Beyond the Keyboard
The future of agents is not just text. It’s Vision (seeing your screen or a photo) and Audio (interpreting your voice and tone). This expands the UX from a "Chat" to a "Sensory Experience."
1. Vision: The "Look at This" Interaction
Users should be able to "Drag and Drop" any visual into the agent.
- UI Element: A screen-capture button or a photo uploader.
- Agent Action: "Analyze this chart from my electricity bill."
- UX Challenge: The agent must explain where in the image it saw the data (using coordinates or bounding boxes).
2. Audio: The "Voice Primary" Interface
Voice agents (like those powered by OpenAI's Realtime API) have Zero Latency.
- UX Requirement: You need a Visual Waverform to show the AI is listening.
- Interruption Support: The UX must handle the user "Talking over" the agent. In a chat box, this is impossible, but in voice, it’s natural.
3. Visualizing the Sensory Input
graph TD
User[Human] -->|Voice| A[Audio Component]
User -->|Screenshot| V[Vision Component]
A --> Master[Master Agent Brain]
V --> Master
Master -->|Action| Tool[Tool Execution]
Master -->|Response| Voice[Speech Synthesis]
Master -->|Response| UI[Visual Bounding Box on Image]
4. Multi-Modal Accessibility
Multi-modal agents are a superpower for accessibility.
- Blind Users: The agent "sees" the screen and describes the UI.
- Deaf Users: The agent "listens" to a video call and provides a real-time, context-aware summary.
5. Engineering Tip: Vision Tokens are Expensive
Processing a high-resolution image costs more tokens than processing a page of text.
- The Optimization: Use a small, low-res model for "Glint" (is there an image here?) and only wake up the expensive "Vision" model when the user asks a specific question about the image details.
Key Takeaways
- Vision allows agents to interact with the world of "Unstructured Visuals."
- Audio (Real-time) requires specialized low-latency protocols like WebSockets.
- Feedback must be multi-modal (if the user speaks, the agent should speak).
- Bounding boxes are the "Citations" for vision-based accuracy.