Overview of Multimodal Models

Overview of Multimodal Models

Understanding the landscape of multimodal LLMs and their capabilities across text, vision, and audio.

Overview of Multimodal Models

Multimodal models can process and reason across multiple types of input—text, images, audio, and even video. This lesson explores the current landscape.

What Makes a Model "Multimodal"?

graph TD
    A[Traditional LLM] --> B[Text Input Only]
    B --> C[Text Output Only]
    
    D[Multimodal LLM] --> E[Text + Images + Audio]
    E --> F[Text Output with Cross-Modal Understanding]
    
    style A fill:#fff3cd
    style D fill:#d4edda

Key Capabilities:

  1. Vision + Language: Understand images and answer questions about them
  2. Audio + Language: Process speech and audio context
  3. Cross-Modal Reasoning: Connect information across modalities
  4. Unified Representation: Shared understanding space

Major Multimodal Models (2024-2026)

GPT-4 Vision (GPT-4V)

Capabilities:

  • Text + image understanding
  • OCR and diagram interpretation
  • Visual question answering
  • Chart and graph analysis

Limitations:

  • No native audio processing
  • Rate limits on vision tokens
  • Higher cost for image inputs
# Conceptual usage
response = gpt4v.generate({
    "text": "What's in this image?",
    "image": product_photo.jpg
})

Claude 3.5 Sonnet (Anthropic)

Capabilities:

  • Best-in-class vision understanding
  • Code generation from screenshots
  • Complex document analysis
  • Long context window (200K tokens)
  • Strong reasoning over multimodal inputs

When to Use:

  • High-accuracy requirements
  • Complex reasoning tasks
  • Large document processing
  • Production RAG systems
# Conceptual: Claude with image
response = claude.messages.create({
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain this diagram"},
            {"type": "image", "source": diagram.png}
        ]
    }]
})

Gemini 1.5 Pro (Google)

Capabilities:

  • Text, image, audio, and video
  • Massive context window (1M+ tokens)
  • Native video understanding
  • Real-time data processing

Unique Strengths:

  • Process entire movies
  • Multi-hour audio transcripts
  • Large codebase analysis

LLaVA and Open-Source Models

Models:

  • LLaVA (Large Language and Vision Assistant)
  • BakLLaVA
  • CogVLM

Advantages:

  • Fully open source
  • Run locally with Ollama
  • No API costs
  • Complete data privacy

Trade-offs:

  • Lower accuracy than commercial models
  • Higher compute requirements
  • Less mature tooling

Model Capabilities Comparison

ModelVisionAudioVideoContext WindowBest For
GPT-4V✅ Excellent128KGeneral purpose
Claude 3.5✅ Best200KProduction RAG
Gemini 1.5✅ Strong1M+Long context
LLaVA✅ Good32KLocal/private

How Multimodal Models Work

graph LR
    A[Input Data] --> B{Input Type}
    
    B -->|Text| C[Text Encoder]
    B -->|Image| D[Vision Encoder]
    B -->|Audio| E[Audio Encoder]
    
    C & D & E --> F[Unified Representation]
    F --> G[Transformer Decoder]
    G --> H[Text Output]

The Architecture

  1. Separate Encoders: Each modality has a specialized encoder

    • Vision: Processes images into embeddings (e.g., CLIP)
    • Audio: Converts sound to features (e.g., Whisper)
    • Text: Tokenizes and embeds text
  2. Projection Layer: Maps different modalities to shared space

  3. Unified Transformer: Processes combined representations

  4. Decoder: Generates text responses

Training Process

graph TD
    A[Pretraining] --> B[Vision-Language Datasets]
    B --> C[Image-Caption Pairs]
    B --> D[OCR Datasets]
    B --> E[Diagram-Text Pairs]
    
    C & D & E --> F[Contrastive Learning]
    F --> G[Aligned Embeddings]
    
    H[Fine-Tuning] --> I[Instruction Following]
    I --> J[Task-Specific Data]
    J --> K[Production Model]

Pretraining:

  • Learn visual concepts
  • Align image and text embeddings
  • Billions of image-text pairs

Fine-Tuning:

  • Task-specific training
  • Instruction following
  • Safety and alignment

Specialized Capabilities

OCR and Document Understanding

Models can extract text from:

  • Scanned documents
  • Handwritten notes
  • Complex layouts (multi-column, tables)
  • Low-quality images
Input: Photo of handwritten receipt
Output: Structured data {date, items, total}

Chart and Diagram Comprehension

graph LR
    A[Chart Image] --> B[Model]
    B --> C[Extract Data Points]
    B --> D[Understand Trends]
    B --> E[Answer Questions]
    
    F["Revenue Chart"] --> B
    B --> G["Q4 revenue was $2.3M, up 15% QoQ"]

Spatial Reasoning

Models can understand:

  • Object positions ("to the left of")
  • Relative sizes
  • Layouts and arrangements
  • 3D perspectives

Code from Screenshots

Input: Screenshot of UI mockup
Output: React/HTML code implementing the design

Embedding Models for Multimodal RAG

Different from generation models, embedding models convert data to vectors:

CLIP (OpenAI)

  • Text and image embeddings in shared space
  • Enables text-to-image and image-to-text search
  • Widely used standard
# Conceptual: CLIP embeddings
text_emb = clip.embed_text("a red car")
image_emb = clip.embed_image(car_photo.jpg)

similarity = cosine_sim(text_emb, image_emb)  # High if image shows red car

ImageBind (Meta)

  • Binds 6 modalities: text, image, audio, depth, thermal, IMU
  • Enables cross-modal search

Multimodal Embeddings from Gemini/Bedrock

  • API-based embedding services
  • Optimized for retrieval
  • Handle images + text

Choosing the Right Model

graph TD
    A{Requirements} --> B{Privacy Critical?}
    
    B -->|Yes| C[Local Models]
    B -->|No| D{Budget?}
    
    C --> E[LLaVA/Ollama]
    
    D -->|Flexible| F{Use Case?}
    D -->|Limited| G[Open Source]
    
    F -->|Complex Docs| H[Claude 3.5]
    F -->|Long Context| I[Gemini 1.5]
    F -->|General| J[GPT-4V]

Decision Criteria:

  1. Privacy: Local vs Cloud
  2. Accuracy: Performance requirements
  3. Cost: API pricing
  4. Latency: Response time needs
  5. Context Length: Document size
  6. Modalities: What inputs are needed?

The Future: True Multimodality

Current models are still primarily text-centric with visual understanding bolted on.

Next Generation:

  • Native video understanding (not just frame extraction)
  • Real-time multimodal streaming
  • Audio-visual reasoning
  • Cross-modal generation (text → image, image → sound)

Key Takeaways

  1. Multimodal models are production-ready for RAG systems
  2. Claude 3.5 Sonnet leads in document understanding
  3. Gemini excels at long context
  4. Open-source options exist for privacy-first deployments
  5. Embeddings models are separate from generation models

In the next lesson, we'll deep dive into Claude Sonnet 3.5 and why it's particularly well-suited for multimodal RAG.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn