Overview of Multimodal Models

Multimodal models can process and reason across multiple types of input—text, images, audio, and even video. This lesson explores the current landscape.

What Makes a Model "Multimodal"?

graph TD
    A[Traditional LLM] --> B[Text Input Only]
    B --> C[Text Output Only]
    
    D[Multimodal LLM] --> E[Text + Images + Audio]
    E --> F[Text Output with Cross-Modal Understanding]
    
    style A fill:#fff3cd
    style D fill:#d4edda

Key Capabilities:

Vision + Language: Understand images and answer questions about them
Audio + Language: Process speech and audio context
Cross-Modal Reasoning: Connect information across modalities
Unified Representation: Shared understanding space

Major Multimodal Models (2024-2026)

GPT-4 Vision (GPT-4V)

Capabilities:

Text + image understanding
OCR and diagram interpretation
Visual question answering
Chart and graph analysis

Limitations:

No native audio processing
Rate limits on vision tokens
Higher cost for image inputs

# Conceptual usage
response = gpt4v.generate({
    "text": "What's in this image?",
    "image": product_photo.jpg
})

Claude 3.5 Sonnet (Anthropic)

Capabilities:

Best-in-class vision understanding
Code generation from screenshots
Complex document analysis
Long context window (200K tokens)
Strong reasoning over multimodal inputs

When to Use:

High-accuracy requirements
Complex reasoning tasks
Large document processing
Production RAG systems

# Conceptual: Claude with image
response = claude.messages.create({
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain this diagram"},
            {"type": "image", "source": diagram.png}
        ]
    }]
})

Gemini 1.5 Pro (Google)

Capabilities:

Text, image, audio, and video
Massive context window (1M+ tokens)
Native video understanding
Real-time data processing

Unique Strengths:

Process entire movies
Multi-hour audio transcripts
Large codebase analysis

LLaVA and Open-Source Models

Models:

LLaVA (Large Language and Vision Assistant)
BakLLaVA
CogVLM

Advantages:

Fully open source
Run locally with Ollama
No API costs
Complete data privacy

Trade-offs:

Lower accuracy than commercial models
Higher compute requirements
Less mature tooling

Model Capabilities Comparison

Model	Vision	Audio	Video	Context Window	Best For
GPT-4V	✅ Excellent	❌	❌	128K	General purpose
Claude 3.5	✅ Best	❌	❌	200K	Production RAG
Gemini 1.5	✅ Strong	✅	✅	1M+	Long context
LLaVA	✅ Good	❌	❌	32K	Local/private

How Multimodal Models Work

graph LR
    A[Input Data] --> B{Input Type}
    
    B -->|Text| C[Text Encoder]
    B -->|Image| D[Vision Encoder]
    B -->|Audio| E[Audio Encoder]
    
    C & D & E --> F[Unified Representation]
    F --> G[Transformer Decoder]
    G --> H[Text Output]

The Architecture

Separate Encoders: Each modality has a specialized encoder
- Vision: Processes images into embeddings (e.g., CLIP)
- Audio: Converts sound to features (e.g., Whisper)
- Text: Tokenizes and embeds text
Projection Layer: Maps different modalities to shared space
Unified Transformer: Processes combined representations
Decoder: Generates text responses

Training Process

graph TD
    A[Pretraining] --> B[Vision-Language Datasets]
    B --> C[Image-Caption Pairs]
    B --> D[OCR Datasets]
    B --> E[Diagram-Text Pairs]
    
    C & D & E --> F[Contrastive Learning]
    F --> G[Aligned Embeddings]
    
    H[Fine-Tuning] --> I[Instruction Following]
    I --> J[Task-Specific Data]
    J --> K[Production Model]

Pretraining:

Learn visual concepts
Align image and text embeddings
Billions of image-text pairs

Fine-Tuning:

Task-specific training
Instruction following
Safety and alignment

Specialized Capabilities

OCR and Document Understanding

Models can extract text from:

Scanned documents
Handwritten notes
Complex layouts (multi-column, tables)
Low-quality images

Input: Photo of handwritten receipt
Output: Structured data {date, items, total}

Chart and Diagram Comprehension

graph LR
    A[Chart Image] --> B[Model]
    B --> C[Extract Data Points]
    B --> D[Understand Trends]
    B --> E[Answer Questions]
    
    F["Revenue Chart"] --> B
    B --> G["Q4 revenue was $2.3M, up 15% QoQ"]

Spatial Reasoning

Models can understand:

Object positions ("to the left of")
Relative sizes
Layouts and arrangements
3D perspectives

Code from Screenshots

Input: Screenshot of UI mockup
Output: React/HTML code implementing the design

Embedding Models for Multimodal RAG

Different from generation models, embedding models convert data to vectors:

CLIP (OpenAI)

Text and image embeddings in shared space
Enables text-to-image and image-to-text search
Widely used standard

# Conceptual: CLIP embeddings
text_emb = clip.embed_text("a red car")
image_emb = clip.embed_image(car_photo.jpg)

similarity = cosine_sim(text_emb, image_emb)  # High if image shows red car

ImageBind (Meta)

Binds 6 modalities: text, image, audio, depth, thermal, IMU
Enables cross-modal search

Multimodal Embeddings from Gemini/Bedrock

API-based embedding services
Optimized for retrieval
Handle images + text

Choosing the Right Model

graph TD
    A{Requirements} --> B{Privacy Critical?}
    
    B -->|Yes| C[Local Models]
    B -->|No| D{Budget?}
    
    C --> E[LLaVA/Ollama]
    
    D -->|Flexible| F{Use Case?}
    D -->|Limited| G[Open Source]
    
    F -->|Complex Docs| H[Claude 3.5]
    F -->|Long Context| I[Gemini 1.5]
    F -->|General| J[GPT-4V]

Decision Criteria:

Privacy: Local vs Cloud
Accuracy: Performance requirements
Cost: API pricing
Latency: Response time needs
Context Length: Document size
Modalities: What inputs are needed?

The Future: True Multimodality

Current models are still primarily text-centric with visual understanding bolted on.

Next Generation:

Native video understanding (not just frame extraction)
Real-time multimodal streaming
Audio-visual reasoning
Cross-modal generation (text → image, image → sound)

Key Takeaways

Multimodal models are production-ready for RAG systems
Claude 3.5 Sonnet leads in document understanding
Gemini excels at long context
Open-source options exist for privacy-first deployments
Embeddings models are separate from generation models

In the next lesson, we'll deep dive into Claude Sonnet 3.5 and why it's particularly well-suited for multimodal RAG.

Overview of Multimodal Models

What Makes a Model "Multimodal"?

Major Multimodal Models (2024-2026)

GPT-4 Vision (GPT-4V)

Claude 3.5 Sonnet (Anthropic)

Gemini 1.5 Pro (Google)

LLaVA and Open-Source Models

Model Capabilities Comparison

How Multimodal Models Work

The Architecture

Training Process

Specialized Capabilities

OCR and Document Understanding

Chart and Diagram Comprehension

Spatial Reasoning

Code from Screenshots

Embedding Models for Multimodal RAG

CLIP (OpenAI)

ImageBind (Meta)

Multimodal Embeddings from Gemini/Bedrock

Choosing the Right Model

The Future: True Multimodality

Key Takeaways

Subscribe to our newsletter