Choosing the Right Model Per Modality

Different data types have different processing requirements. This lesson teaches you to match models to modalities for optimal results.

The Modality-Model Matrix

graph TD
    A[Data Modality] --> B[Text]
    A --> C[Images]
    A --> D[Audio]
    A --> E[Video]
    A --> F[Structured Data]
    
    B --> G[Text LLMs]
    C --> H[Vision Models]
    D --> I[Transcription + Text LLMs]
    E --> J[Frame Extraction + Vision]
    F --> K[Schema-Aware LLMs]

Text Processing

Model Selection for Text

text_models = {
    "simple_qa": {
        "model": "mistral-7b (Ollama)",
        "reason": "Fast, cheap, good enough"
    },
    "complex_reasoning": {
        "model": "claude-3.5-sonnet",
        "reason": "Best reasoning, long context"
    },
    "code_generation": {
        "model": "claude-3.5-sonnet",
        "reason": "Best coding performance"
    },
    "bulk_processing": {
        "model": "claude-haiku",
        "reason": "Fast and cheap for volume"
    }
}

Text Embedding Models

graph LR
    A[Text Embeddings] --> B[OpenAI Ada-002]
    A --> C[Cohere Embed]
    A --> D[BGE / E5]
    A --> E[Ollama Embeddings]
    
    B --> F[Industry Standard]
    C --> G[Best for Search]
    D --> H[Open Source]
    E --> I[Local/Private]

Recommendations:

Use Case	Model	Why
Production RAG	Cohere Embed v3	Best retrieval quality
Cost-sensitive	OpenAI Ada-002	Good balance
Multilingual	BGE-M3	100+ languages
Local deployment	Nomic Embed	Open source, good quality

Image Processing

Vision Model Selection

image_tasks = {
    "ocr": {
        "simple": "Tesseract (open source)",
        "complex": "Claude 3.5 with vision",
        "production": "AWS Textract + Claude"
    },
    "image_understanding": {
        "best": "Claude 3.5 Sonnet",
        "local": "LLaVA 13B",
        "fast": "moondream (1.8B)"
    },
    "chart_analysis": {
        "best": "Claude 3.5 or GPT-4V",
        "acceptable": "Gemini 1.5"
    },
    "diagram_interpretation": {
        "technical": "Claude 3.5",
        "simple": "LLaVA"
    }
}

Image Embedding Models

graph TD
    A[Image Embeddings] --> B[CLIP]
    A --> C[ImageBind]
    A --> D[DINOv2]
    
    B --> E[Best for Image-Text Matching]
    C --> F[Multimodal Unified Space]
    D --> G[Best Visual Features]

For RAG:

# CLIP for cross-modal search
text_query = "red sports car"
image_results = search_by_text(text_query)  # Returns car images

# Or reverse
image_query = car_photo.jpg
text_results = search_by_image(image_query)  # Returns descriptions

Image Processing Pipeline

graph LR
    A[Image Input] --> B{Image Type}
    
    B -->|Photo| C[CLIP Embedding]
    B -->|Document| D[OCR + Text Processing]
    B -->|Chart| E[Claude Vision Analysis]
    B -->|Diagram| F[LLaVA or Claude]
    
    C & D & E & F --> G[Vector DB]

Audio Processing

Audio Pipeline

graph TD
    A[Audio File] --> B[Whisper Transcription]
    B --> C[Text Output]
    C --> D{Processing Goal}
    
    D -->|Search| E[Text Embeddings]
    D -->|Analysis| F[LLM Processing]
    D -->|QA| G[RAG Pipeline]
    
    style B fill:#d4edda

Transcription Models

audio_transcription = {
    "whisper_large_v3": {
        "accuracy": "Best (99%+)",
        "speed": "Slow (10× real-time)",
        "languages": 100,
        "use": "Production quality"
    },
    "whisper_medium": {
        "accuracy": "Good (95-98%)",
        "speed": "Medium (5× real-time)",
        "use": "Balanced"
    },
    "whisper_small": {
        "accuracy": "OK (90-95%)",
        "speed": "Fast (2× real-time)",
        "use": "Quick processing"
    },
    "deepgram_api": {
        "accuracy": "Excellent (98%+)",
        "speed": "Real-time",
        "cost": "Pay-per-use",
        "use": "Cloud transcription"
    }
}

Audio RAG Strategy

# Conceptual: Audio RAG processing
def process_audio(audio_file):
    # 1. Transcribe
    transcript = whisper.transcribe(audio_file, language='en')
    
    # 2. Add timestamps
    chunks = split_by_timestamp(transcript, chunk_size=30)  # 30-second chunks
    
    # 3. Embed
    for chunk in chunks:
        embedding = embed(chunk.text)
        store(embedding, metadata={
            'source': audio_file,
            'timestamp': chunk.start_time,
            'speaker': chunk.speaker  # If diarization available
        })

Speaker Diarization

For multi-speaker audio (meetings, interviews):

# pyannote.audio for speaker separation
diarization = pyannote.identify_speakers(audio_file)

# Result:
# [0:00-0:45] Speaker A: "Welcome to the meeting..."
# [0:45-1:30] Speaker B: "Thank you, I'd like to discuss..."

Video Processing

Video as Multi-Frame Images

graph TD
    A[Video File] --> B[Extract Key Frames]
    B --> C[Frame 1, 5, 10, ...]
    C --> D[Vision Model Analysis]
    D --> E[Scene Descriptions]
    
    A --> F[Extract Audio]
    F --> G[Whisper Transcription]
    G --> H[Spoken Content]
    
    E & H --> I[Combined Multimodal Index]

Frame Extraction Strategy

def extract_key_frames(video_path):
    # Option 1: Time-based (every N seconds)
    frames = extract_every_n_seconds(video_path, n=5)
    
    # Option 2: Scene change detection
    frames = extract_on_scene_change(video_path)
    
    # Option 3: Hybrid
    key_moments = detect_important_moments(video_path)  # ML-based
    frames = extract_at_timestamps(video_path, key_moments)
    
    return frames

Video RAG Architecture

# Conceptual: Video processing for RAG
def index_video(video_file):
    # 1. Extract audio and transcribe
    audio = extract_audio(video_file)
    transcript = whisper.transcribe(audio)
    
    # 2. Extract key frames
    frames = extract_key_frames(video_file, every_n_seconds=10)
    
    # 3. Analyze frames
    frame_descriptions = []
    for frame in frames:
        desc = claude_vision.describe(frame)
        frame_descriptions.append({
            'timestamp': frame.timestamp,
            'description': desc,
            'embedding': embed(desc)
        })
    
    # 4. Align transcript with frames
    aligned_content = align_text_and_images(transcript, frame_descriptions)
    
    # 5. Index
    for content in aligned_content:
        index(content)

Video-Specific Models

video_models = {
    "frame_analysis": "Claude 3.5 or GPT-4V",
    "scene_understanding": "Gemini 1.5 (native video)",
    "action_recognition": "Specialized CV models (ActivityNet)",
    "object_tracking": "YOLO + tracking algorithms"
}

Structured Data (Tables, JSON, CSVs)

Schema-Aware Processing

graph LR
    A[Structured Data] --> B{Format}
    
    B -->|CSV| C[Parse to DataFrame]
    B -->|JSON| D[Parse to Dict]
    B -->|SQL| E[Query Results]
    
    C & D & E --> F[LLM with Schema Context]
    F --> G[Natural Language Understanding]

Table Processing

# Approach 1: Serialize table to text
table_as_text = """
| Quarter | Revenue | Profit |
|---------|---------|--------|
| Q1 2025 | $2.0M   | $0.4M  |
| Q2 2025 | $2.2M   | $0.5M  |
| Q3 2025 | $2.1M   | $0.45M |
| Q4 2025 | $2.3M   | $0.6M  |
"""

# Approach 2: Convert to structured format
table_as_json = {
    "schema": ["Quarter", "Revenue", "Profit"],
    "rows": [
        ["Q1 2025", 2000000, 400000],
        ["Q2 2025", 2200000, 500000],
        ...
    ]
}

Best Models for Structured Data

structured_data_models = {
    "tables": {
        "best": "Claude 3.5 (excellent table understanding)",
        "alternative": "GPT-4 with function calling"
    },
    "json": {
        "best": "Any modern LLM",
        "prefer": "Models with JSON mode"
    },
    "spreadsheets": {
        "approach": "Parse to table + LLM",
        "tools": "Pandas + Claude"
    },
    "databases": {
        "approach": "Text-to-SQL + query + LLM summary",
        "tools": "LangChain SQL agents"
    }
}

Multi-Modal Fusion

Cross-Modal Queries

# Example: Product search with image + text
query = {
    "text": "comfortable running shoes under $100",
    "reference_image": "preferred_style.jpg"
}

# Retrieval strategy:
# 1. Text embedding for semantic search
# 2. Image embedding for visual similarity
# 3. Combine scores (weighted)
# 4. Filter by price (<$100)

results = multimodal_search(
    text_weight=0.6,
    image_weight=0.4,
    filters={'price': {'$lt': 100}}
)

Model Routing Logic

def route_to_model(data, query):
    """Smart routing based on data type and complexity"""
    
    if is_image(data):
        if needs_ocr(data):
            return "claude-3.5-sonnet vision"
        else:
            return "llava-13b"  # Cheaper for simple tasks
    
    elif is_audio(data):
        transcript = whisper.transcribe(data)
        return process_text(transcript, query)
    
    elif is_video(data):
        frames = extract_frames(data)
        audio = extract_audio(data)
        return process_multimodal(frames, audio, query)
    
    elif is_structured(data):
        if is_complex_query(query):
            return "claude-3.5-sonnet"
        else:
            return "mixtral-8x7b"  # Cheaper for simple queries
    
    else:  # Text
        if len(data) > 100000:
            return "claude-3.5-sonnet"  # Long context
        else:
            return "gpt-4"  # Standard

Decision Framework

graph TD
    A{Data Type?} --> B[Text]
    A --> C[Image]
    A --> D[Audio]
    A --> E[Video]
    A --> F[Structured]
    
    B --> B1{Complexity?}
    B1 -->|Simple| B2[Mixtral/Mistral]
    B1 -->|Complex| B3[Claude 3.5]
    
    C --> C1{Task?}
    C1 -->|OCR| C2[Claude + Textract]
    C1 -->|Understanding| C3[Claude/GPT-4V]
    
    D --> D1[Whisper + Text Model]
    E --> E1[Frame Extract + Audio + Fusion]
    F --> F1[Claude 3.5]

Key Takeaways

Text: Claude 3.5 for complex, Mixtral for simple
Images: Claude 3.5 for production, LLaVA for local
Audio: Whisper transcription + text models
Video: Frame extraction + audio transcription
Structured: Claude 3.5 excels at tables and JSON
Always consider: Privacy, cost, latency trade-offs per modality

In Module 3, we'll design end-to-end RAG architectures that combine all these modalities.