Multimodal AI: Teaching Machines to See and Hear

For most of this course, we have focused on Text. But the world is not just text. Real intelligence requires the ability to interpret images, watch videos, and hear the nuances of human speech. This is Multimodal AI.

In this lesson, we explore how LLM Engineers integrate "Sensory Data" into their agentic workflows.

1. What is a Multimodal Model?

A multimodal model (like GPT-4o or Gemini 1.5 Pro) has been trained on a mixture of text, images, and audio simultaneously. This means it doesn't just "Describe an image"; it Reasons across different types of data.

Example: You show the model a photo of a refrigerator and ask: "What can I cook for dinner?"

Vision: The model identifies "Milk, Eggs, Spinach."
Reasoning: It remembers a recipe for a Frittata.
Output: It writes the steps in text.

2. Image Processing for Agents

Use Cases:

OCR 2.0: Instead of using a dedicated OCR tool, you show the LLM a messy handwritten receipt. It understands the "Layout" and "Intent" much better than traditional software.
UI Testing: An agent that "Looks" at a website to find broken buttons or visual bugs.
Medical Analysis: Analyzing X-rays to identify possible fractures (with human oversight).

graph LR
    A[Raw Image: Receipt] --> B[Multimodal LLM]
    B --> C[Structured JSON: Total, Items, Tax]
    B --> D[Reasoning: 'This looks like a fraudulent expense']

3. Audio and Video: The Next Frontier

Video (Gemini 1.5 Pro)

Gemini 1.5 Pro can process up to 1 hour of video in its context window. You can ask: "At what timestamp does the person in the video pick up the red cup?"

Audio (Speech-to-Speech)

Newer models (like GPT-4o's Voice Mode) process raw audio waves directly. They can detect the Emotion, Sarcasm, and Tone of the speaker, which is lost in traditional "Speech-to-Text" transcriptions.

4. The Engineering Workflow for Images

When sending an image to an LLM, you don't send the "File." You send a Base64 Encoded String.

import base64

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# The payload sent to the API
payload = {
  "role": "user",
  "content": [
    {"type": "text", "text": "What is in this image?"},
    {
      "type": "image_url",
      "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
    }
  ]
}

5. Cost and Latency of Multimodality

Processing a single high-resolution image can cost the same as 1,000 tokens of text. Furthermore, it takes much longer for the "Attention" mechanism to process the pixels.

LLM Engineer Strategy: Use Image Resizing. Downscale your images to roughly 512x512 before sending them to the model. This preserves enough detail for the model to see, while saving you 80% on token costs and latency.

Summary

Multimodal Models allow agents to reason across Text, Image, and Audio.
Vision is the most mature modality; used for OCR, UI testing, and analysis.
Base64 is the transport format for images in your Python code.
Optimization via resizing is critical to managing multimodal costs.

In the next lesson, we will look at Long-Term Memory, moving from short-term context to databases that allow an AI to "Know" a user for years.

Exercise: The Accessibility Agent

You are building an app to help visually impaired users. They take a photo of a shelf in a grocery store.

What is the First Step your agent should take?
Why would a Multimodal LLM be better than a traditional "Object Detection" model (like YOLO)?

Answer Logic:

Contextual Analysis: Identify the specific items on the shelf.
Reasoning: A traditional model might say "Box 1, Box 2." A Multimodal LLM can say: "There are two types of cereal. One is the organic brand you usually buy, and it's on sale for $4.00." It understands the Context and the User's History.