
The Multi-Sensory Agent: Unified Modalities
Engineer the ultimate autonomous system. Learn how to design LangGraph workflows that process text, images, and audio in a single, synchronized reasoning loop.
Combining Modalities in a Single Graph
We have looked at sight, sound, and text in isolation. But true intelligence is Synesthetic—it combines inputs from all "Senses" to reach a single, high-fidelity conclusion.
- A "Customer Service Agent" reads the message (Text), looks at the screenshot (Vision), and listens to the voicemail (Audio) simultaneously.
In this lesson, we will learn how to build a Unified State in LangGraph that acts as a central hub for all modalities.
1. The Unified Multi-Modal State
In a multi-modal agent, your TypedDict state must be expanded to handle binary and structured data.
class UnifiedState(TypedDict):
# Text History
messages: Annotated[list, operator.add]
# Visual Context
last_screenshot: str # Base64
object_coordinates: List[dict] # [ {item: "button", x: 10, y: 20} ]
# Audio Context
audio_transcript: str
audio_emotion: str # e.g., "Frustrated", "Happy"
# Reasoning
current_intent: str
2. The "Sensor Fusion" Node
Before the LLM "Brain" acts, you should have Pre-processing Nodes that act as the agent's "Sensors."
- Audio Node: Transcribes the user's voice and detects emotion.
- Vision Node: Identifies objects in the latest screenshot.
- Synthesis Node: The LLM reads the transcript + the object list + the chat history and decides: "The user is frustrated (Audio) because they can't find the 'Submit' button (Vision/Text). I should highlight it for them."
3. Orchestration: Order of Operations
The order in which you process modalities matters for Cost and Speed.
The "Text-First" Pattern
- Process Text (Cheapest).
- If text is ambiguous ("It's broken!"), trigger Vision (Expensive).
- If vision is ambiguous, trigger Audio Analysis.
Why? Most user intent is in the text. You only use the "Heavy" modalities when the high-level reasoning engine asks for them.
4. The "Multi-Modal Token" Problem
When you combine text, images, and audio, your token count explodes.
- 1 System Prompt: 1k tokens.
- 1 Screenshot: 1.5k tokens.
- 1 Transcript: 500 tokens.
- Turn 1: 3,000 tokens.
- Turn 10: 30,000+ tokens.
Solution: You must use Aggressive Tool Cleaning. After the "Vision Node" extracts the button coordinates, Delete the image from the State. Keep the "Coordinates" (100 tokens) but throw away the "Image" (1,500 tokens).
5. Tool Use Across Senses
Your tools can also be multi-modal.
draw_circle(x, y): A tool that modifies the image the user sees.speak_with_tone(text, emotion): A tool that uses TTS to respond.search_by_image(image_url): A tool that uses vector search (CLIP) to find similar images.
6. Real-World Case Study: The IT Support Agent
- User: "My screen looks weird." (Text)
- Agent Node: Calls
request_screenshot(). - User: Uploads image. (Vision)
- Agent Node: Vision analysis identifies a "Blue Screen of Death" error code.
- Agent Node: Agent speaks to the user via TTS: "I see the 0x01 error code. Please hold while I search for the fix." (Audio)
- Agent Node: Agent researches and sends the fix back as Text.
Summary and Mental Model
Think of a Multi-Modal Graph like a Flight Controller.
- They have a Radio (Audio).
- They have a Radar (Vision).
- They have a Flight Plan (Text).
A good controller (Agent) doesn't stare at the radar all second; they check it when the radio reports a problem.
Exercise: Multi-Modal Design
- The Architecture: You are building an agent for Deaf or Hard-of-Hearing users.
- The agent must translate Sign Language (Video) into Speech (TTS).
- What is the Step-by-Step Flow of nodes in this graph?
- Efficiency: Why is it better to store "Extracted Facts" from an image in the state rather than the "Image URL" itself?
- Safety: If an agent hears a "Scream" in the background of an audio file, should it continue its task or trigger a Safety_Interrupt_Node?
- How would you implement that logic in LangGraph? You've mastered the sensors. Now, let's look at the "Soul" of the agent: Long Term Memory.