Combining Modalities in a Single Graph

We have looked at sight, sound, and text in isolation. But true intelligence is Synesthetic—it combines inputs from all "Senses" to reach a single, high-fidelity conclusion.

A "Customer Service Agent" reads the message (Text), looks at the screenshot (Vision), and listens to the voicemail (Audio) simultaneously.

In this lesson, we will learn how to build a Unified State in LangGraph that acts as a central hub for all modalities.

1. The Unified Multi-Modal State

In a multi-modal agent, your TypedDict state must be expanded to handle binary and structured data.

class UnifiedState(TypedDict):
    # Text History
    messages: Annotated[list, operator.add]
    
    # Visual Context
    last_screenshot: str # Base64
    object_coordinates: List[dict] # [ {item: "button", x: 10, y: 20} ]
    
    # Audio Context
    audio_transcript: str
    audio_emotion: str # e.g., "Frustrated", "Happy"
    
    # Reasoning
    current_intent: str

2. The "Sensor Fusion" Node

Before the LLM "Brain" acts, you should have Pre-processing Nodes that act as the agent's "Sensors."

Audio Node: Transcribes the user's voice and detects emotion.
Vision Node: Identifies objects in the latest screenshot.
Synthesis Node: The LLM reads the transcript + the object list + the chat history and decides: "The user is frustrated (Audio) because they can't find the 'Submit' button (Vision/Text). I should highlight it for them."

3. Orchestration: Order of Operations

The order in which you process modalities matters for Cost and Speed.

The "Text-First" Pattern

Process Text (Cheapest).
If text is ambiguous ("It's broken!"), trigger Vision (Expensive).
If vision is ambiguous, trigger Audio Analysis.

Why? Most user intent is in the text. You only use the "Heavy" modalities when the high-level reasoning engine asks for them.

4. The "Multi-Modal Token" Problem

When you combine text, images, and audio, your token count explodes.

1 System Prompt: 1k tokens.
1 Screenshot: 1.5k tokens.
1 Transcript: 500 tokens.
Turn 1: 3,000 tokens.
Turn 10: 30,000+ tokens.

Solution: You must use Aggressive Tool Cleaning. After the "Vision Node" extracts the button coordinates, Delete the image from the State. Keep the "Coordinates" (100 tokens) but throw away the "Image" (1,500 tokens).

5. Tool Use Across Senses

Your tools can also be multi-modal.

draw_circle(x, y): A tool that modifies the image the user sees.
speak_with_tone(text, emotion): A tool that uses TTS to respond.
search_by_image(image_url): A tool that uses vector search (CLIP) to find similar images.

6. Real-World Case Study: The IT Support Agent

User: "My screen looks weird." (Text)
Agent Node: Calls request_screenshot().
User: Uploads image. (Vision)
Agent Node: Vision analysis identifies a "Blue Screen of Death" error code.
Agent Node: Agent speaks to the user via TTS: "I see the 0x01 error code. Please hold while I search for the fix." (Audio)
Agent Node: Agent researches and sends the fix back as Text.

Summary and Mental Model

Think of a Multi-Modal Graph like a Flight Controller.

They have a Radio (Audio).
They have a Radar (Vision).
They have a Flight Plan (Text).

A good controller (Agent) doesn't stare at the radar all second; they check it when the radio reports a problem.

Exercise: Multi-Modal Design

The Architecture: You are building an agent for Deaf or Hard-of-Hearing users.
- The agent must translate Sign Language (Video) into Speech (TTS).
- What is the Step-by-Step Flow of nodes in this graph?
Efficiency: Why is it better to store "Extracted Facts" from an image in the state rather than the "Image URL" itself?
Safety: If an agent hears a "Scream" in the background of an audio file, should it continue its task or trigger a Safety_Interrupt_Node?
- How would you implement that logic in LangGraph? You've mastered the sensors. Now, let's look at the "Soul" of the agent: Long Term Memory.

The Multi-Sensory Agent: Unified Modalities