
Overview of Multimodal Models
Understanding the landscape of multimodal LLMs and their capabilities across text, vision, and audio.
Overview of Multimodal Models
Multimodal models can process and reason across multiple types of input—text, images, audio, and even video. This lesson explores the current landscape.
What Makes a Model "Multimodal"?
graph TD
A[Traditional LLM] --> B[Text Input Only]
B --> C[Text Output Only]
D[Multimodal LLM] --> E[Text + Images + Audio]
E --> F[Text Output with Cross-Modal Understanding]
style A fill:#fff3cd
style D fill:#d4edda
Key Capabilities:
- Vision + Language: Understand images and answer questions about them
- Audio + Language: Process speech and audio context
- Cross-Modal Reasoning: Connect information across modalities
- Unified Representation: Shared understanding space
Major Multimodal Models (2024-2026)
GPT-4 Vision (GPT-4V)
Capabilities:
- Text + image understanding
- OCR and diagram interpretation
- Visual question answering
- Chart and graph analysis
Limitations:
- No native audio processing
- Rate limits on vision tokens
- Higher cost for image inputs
# Conceptual usage
response = gpt4v.generate({
"text": "What's in this image?",
"image": product_photo.jpg
})
Claude 3.5 Sonnet (Anthropic)
Capabilities:
- Best-in-class vision understanding
- Code generation from screenshots
- Complex document analysis
- Long context window (200K tokens)
- Strong reasoning over multimodal inputs
When to Use:
- High-accuracy requirements
- Complex reasoning tasks
- Large document processing
- Production RAG systems
# Conceptual: Claude with image
response = claude.messages.create({
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Explain this diagram"},
{"type": "image", "source": diagram.png}
]
}]
})
Gemini 1.5 Pro (Google)
Capabilities:
- Text, image, audio, and video
- Massive context window (1M+ tokens)
- Native video understanding
- Real-time data processing
Unique Strengths:
- Process entire movies
- Multi-hour audio transcripts
- Large codebase analysis
LLaVA and Open-Source Models
Models:
- LLaVA (Large Language and Vision Assistant)
- BakLLaVA
- CogVLM
Advantages:
- Fully open source
- Run locally with Ollama
- No API costs
- Complete data privacy
Trade-offs:
- Lower accuracy than commercial models
- Higher compute requirements
- Less mature tooling
Model Capabilities Comparison
| Model | Vision | Audio | Video | Context Window | Best For |
|---|---|---|---|---|---|
| GPT-4V | ✅ Excellent | ❌ | ❌ | 128K | General purpose |
| Claude 3.5 | ✅ Best | ❌ | ❌ | 200K | Production RAG |
| Gemini 1.5 | ✅ Strong | ✅ | ✅ | 1M+ | Long context |
| LLaVA | ✅ Good | ❌ | ❌ | 32K | Local/private |
How Multimodal Models Work
graph LR
A[Input Data] --> B{Input Type}
B -->|Text| C[Text Encoder]
B -->|Image| D[Vision Encoder]
B -->|Audio| E[Audio Encoder]
C & D & E --> F[Unified Representation]
F --> G[Transformer Decoder]
G --> H[Text Output]
The Architecture
-
Separate Encoders: Each modality has a specialized encoder
- Vision: Processes images into embeddings (e.g., CLIP)
- Audio: Converts sound to features (e.g., Whisper)
- Text: Tokenizes and embeds text
-
Projection Layer: Maps different modalities to shared space
-
Unified Transformer: Processes combined representations
-
Decoder: Generates text responses
Training Process
graph TD
A[Pretraining] --> B[Vision-Language Datasets]
B --> C[Image-Caption Pairs]
B --> D[OCR Datasets]
B --> E[Diagram-Text Pairs]
C & D & E --> F[Contrastive Learning]
F --> G[Aligned Embeddings]
H[Fine-Tuning] --> I[Instruction Following]
I --> J[Task-Specific Data]
J --> K[Production Model]
Pretraining:
- Learn visual concepts
- Align image and text embeddings
- Billions of image-text pairs
Fine-Tuning:
- Task-specific training
- Instruction following
- Safety and alignment
Specialized Capabilities
OCR and Document Understanding
Models can extract text from:
- Scanned documents
- Handwritten notes
- Complex layouts (multi-column, tables)
- Low-quality images
Input: Photo of handwritten receipt
Output: Structured data {date, items, total}
Chart and Diagram Comprehension
graph LR
A[Chart Image] --> B[Model]
B --> C[Extract Data Points]
B --> D[Understand Trends]
B --> E[Answer Questions]
F["Revenue Chart"] --> B
B --> G["Q4 revenue was $2.3M, up 15% QoQ"]
Spatial Reasoning
Models can understand:
- Object positions ("to the left of")
- Relative sizes
- Layouts and arrangements
- 3D perspectives
Code from Screenshots
Input: Screenshot of UI mockup
Output: React/HTML code implementing the design
Embedding Models for Multimodal RAG
Different from generation models, embedding models convert data to vectors:
CLIP (OpenAI)
- Text and image embeddings in shared space
- Enables text-to-image and image-to-text search
- Widely used standard
# Conceptual: CLIP embeddings
text_emb = clip.embed_text("a red car")
image_emb = clip.embed_image(car_photo.jpg)
similarity = cosine_sim(text_emb, image_emb) # High if image shows red car
ImageBind (Meta)
- Binds 6 modalities: text, image, audio, depth, thermal, IMU
- Enables cross-modal search
Multimodal Embeddings from Gemini/Bedrock
- API-based embedding services
- Optimized for retrieval
- Handle images + text
Choosing the Right Model
graph TD
A{Requirements} --> B{Privacy Critical?}
B -->|Yes| C[Local Models]
B -->|No| D{Budget?}
C --> E[LLaVA/Ollama]
D -->|Flexible| F{Use Case?}
D -->|Limited| G[Open Source]
F -->|Complex Docs| H[Claude 3.5]
F -->|Long Context| I[Gemini 1.5]
F -->|General| J[GPT-4V]
Decision Criteria:
- Privacy: Local vs Cloud
- Accuracy: Performance requirements
- Cost: API pricing
- Latency: Response time needs
- Context Length: Document size
- Modalities: What inputs are needed?
The Future: True Multimodality
Current models are still primarily text-centric with visual understanding bolted on.
Next Generation:
- Native video understanding (not just frame extraction)
- Real-time multimodal streaming
- Audio-visual reasoning
- Cross-modal generation (text → image, image → sound)
Key Takeaways
- Multimodal models are production-ready for RAG systems
- Claude 3.5 Sonnet leads in document understanding
- Gemini excels at long context
- Open-source options exist for privacy-first deployments
- Embeddings models are separate from generation models
In the next lesson, we'll deep dive into Claude Sonnet 3.5 and why it's particularly well-suited for multimodal RAG.