Mixing Modalities in Context

In Multimodal RAG, your context isn't just text. It's a blend of data streams. How you format this mixture determines whether the LLM can "connect the dots."

The "Interleaved" Approach

Don't group all text at the top and all images at the bottom. Interleave them semantically.

Example: A Repair Manual RAG

Text Chunk: "To remove the filter, first unscrew the blue cap."
Image Metadata: "[Visual: Photo of a blue cap with a 'Left-Turn' arrow]"
Transcript Chunk: "[Audio: User manual narrator says 'Be careful not to strip the threads.']"

Formatting Image Context

Since most LLMs cannot "see" images in a standard text prompt, you must provide text representations:

Descriptions: Generated during ingestion.
OCR: Extracted from the image.
Labels/Tags: Categorical data.

### Document 4 (Visual)
Source: dashboard_screenshot.png
Description: A line chart showing a spike in server CPU usage at 2:00 PM.
OCR Text: "Alert: CPU 98%"

Using Multimodal LLMs (Vision)

With models like Claude 3.5 Sonnet, you can send the actual Image bytes alongside the text context.

Pro: Highest possible accuracy; the model "sees" the subtle details.
Con: Highly expensive; uses a large number of visual tokens (e.g., 1,600 tokens per image).

Balancing Context

When mixing modalities:

Use Descriptive Headers to tell the model what it's looking at (e.g., "This is a transcript from a meeting").
Use Time-Offsets to help the model realize that an audio segment and a video frame happened simultaneously.

Exercises

How would you format a "Spreadsheet Row" so that it makes sense alongside a "Narrative Paragraph"?
If you have 10 photos of the same event, should you include all of them or just a text summary?
Design a prompt for a "Movie RAG" that includes the script and a description of the actor's facial expression.