Module 12 Lesson 1: Multimodality (Beyond Text)

For the last 11 modules, we have talked about LLMs as text-processing machines. But the future is Multimodal.

A multimodal model (like GPT-4o, Claude 3.5, or Gemini 1.5) doesn't just read words; it "sees" images and "hears" audio directly, without needing a separate translator. In this lesson, we learn how AI is expanding its senses.

1. Everything is a Token

The secret to multimodality is that we can turn anything into a vector (Module 3).

Text: Is sliced into subword tokens.
Images: Are sliced into small 16x16 squares (patches), which are then converted into "Visual Tokens."
Audio: The sound waves are sampled and converted into "Audio Tokens."

Because the Transformer (Module 5) only cares about the relationships between tokens, it doesn't care if the tokens came from a dictionary or a camera. It can "Attend" to a visual token as easily as it attends to a word.

2. Shared Vector Space

In a multimodal model, the word "Apple" (Text) and a picture of a red fruit (Image) are mapped to the same location in the model's high-dimensional space.

This allows the model to "understand" that the text description and the visual appearance are two ways of saying the same thing.

graph TD
    Text["Text Input: 'A sunny beach'"] --> VectorSpace["Shared Vector Space"]
    Image["Image Input: [JPEG Data]"] --> VectorSpace
    Audio["Audio Input: [WAV Data]"] --> VectorSpace
    VectorSpace --> Brain["Large Multimodal Model (LMM)"]
    Brain --> Output["Combined Insight: 'This is a photo of Hawaii'"]

3. Why it matters for Utility

Multimodality changes the way we use AI:

Law: Drag a 50-page PDF and a photo of a handwritten signature in, and ask if they match.
Medicine: Show a photo of a skin rash and ask for the textbook definition (RAG) of similar symptoms.
Education: Record a video of a science experiment and ask the AI to explain the physics occurring in real-time.

4. The "Native" vs. "Stitched" Model

Stitched Models: Early AI used one model for vision and sent the text description to an LLM. (Slower, loses detail).
Native Models: Modern models are trained on text, image, and voice simultaneously. This makes them much faster and allows them to understand nuance (like the "tone" of a voice or the "vibe" of a photo) that a text description would miss.

Lesson Exercise

The Visual Logic Test:

Take a photo of a complex scene (e.g., your desk or a busy street).
Upload it to a multimodal AI.
Ask: "What is the smallest object in this photo, and where is it relative to the largest object?"
Notice how the AI uses "Spatial Reasoning" alongside its language skills.

Observation: You'll see that the AI isn't just "identifying" objects; it is understanding the layout of the world in the same way it understands the layout of a sentence.

Summary

In this lesson, we established:

Multimodality converts images, audio, and video into tokens.
A shared vector space allows different senses to be combined into one understanding.
"Native" multimodal models are the new standard for fast, high-nuance AI.

Next Lesson: We look at the "Long-Term Memory." We'll learn how models are moving toward Personalization and whether the context window will eventually become infinite.