Module 12 Lesson 1: Multimodality (Beyond Text)
·Artificial Intelligence

Module 12 Lesson 1: Multimodality (Beyond Text)

Language is only the beginning. In this lesson, we explore Multimodality—the shift from Large Language Models to Large Multimodal Models that can see, hear, and speak.

Module 12 Lesson 1: Multimodality (Beyond Text)

For the last 11 modules, we have talked about LLMs as text-processing machines. But the future is Multimodal.

A multimodal model (like GPT-4o, Claude 3.5, or Gemini 1.5) doesn't just read words; it "sees" images and "hears" audio directly, without needing a separate translator. In this lesson, we learn how AI is expanding its senses.


1. Everything is a Token

The secret to multimodality is that we can turn anything into a vector (Module 3).

  • Text: Is sliced into subword tokens.
  • Images: Are sliced into small 16x16 squares (patches), which are then converted into "Visual Tokens."
  • Audio: The sound waves are sampled and converted into "Audio Tokens."

Because the Transformer (Module 5) only cares about the relationships between tokens, it doesn't care if the tokens came from a dictionary or a camera. It can "Attend" to a visual token as easily as it attends to a word.


2. Shared Vector Space

In a multimodal model, the word "Apple" (Text) and a picture of a red fruit (Image) are mapped to the same location in the model's high-dimensional space.

  • This allows the model to "understand" that the text description and the visual appearance are two ways of saying the same thing.
graph TD
    Text["Text Input: 'A sunny beach'"] --> VectorSpace["Shared Vector Space"]
    Image["Image Input: [JPEG Data]"] --> VectorSpace
    Audio["Audio Input: [WAV Data]"] --> VectorSpace
    VectorSpace --> Brain["Large Multimodal Model (LMM)"]
    Brain --> Output["Combined Insight: 'This is a photo of Hawaii'"]

3. Why it matters for Utility

Multimodality changes the way we use AI:

  • Law: Drag a 50-page PDF and a photo of a handwritten signature in, and ask if they match.
  • Medicine: Show a photo of a skin rash and ask for the textbook definition (RAG) of similar symptoms.
  • Education: Record a video of a science experiment and ask the AI to explain the physics occurring in real-time.

4. The "Native" vs. "Stitched" Model

  • Stitched Models: Early AI used one model for vision and sent the text description to an LLM. (Slower, loses detail).
  • Native Models: Modern models are trained on text, image, and voice simultaneously. This makes them much faster and allows them to understand nuance (like the "tone" of a voice or the "vibe" of a photo) that a text description would miss.

Lesson Exercise

The Visual Logic Test:

  1. Take a photo of a complex scene (e.g., your desk or a busy street).
  2. Upload it to a multimodal AI.
  3. Ask: "What is the smallest object in this photo, and where is it relative to the largest object?"
  4. Notice how the AI uses "Spatial Reasoning" alongside its language skills.

Observation: You'll see that the AI isn't just "identifying" objects; it is understanding the layout of the world in the same way it understands the layout of a sentence.


Summary

In this lesson, we established:

  • Multimodality converts images, audio, and video into tokens.
  • A shared vector space allows different senses to be combined into one understanding.
  • "Native" multimodal models are the new standard for fast, high-nuance AI.

Next Lesson: We look at the "Long-Term Memory." We'll learn how models are moving toward Personalization and whether the context window will eventually become infinite.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn