Beyond the Text Box

For the first year of the GenAI boom, we were obsessed with "Text-in, Text-out." But humans don't just communicate with text—we use diagrams, voices, and videos. A Multi-Modal Foundation Model is one that can process multiple types of data simultaneously.

In the AWS Certified Generative AI Developer – Professional exam, you must demonstrate how to build these complex, sensory-rich applications.

1. Defining Modalities

A "Modality" is a single type of input or output.

Unimodal: Text $\rightarrow$ Text.
Bi-modal: Text + Image $\rightarrow$ Text.
Multi-modal: Text + Image + Video + Audio $\rightarrow$ Unified Reasoning.

2. Bedrock's Multi-Modal Engines

Not all models are multi-modal. As a pro-developer, you need to know which tool to pick for each "Sense."

The Vision Experts (Claude 3 / 3.5)

Strengths: Best-in-class at "Reasoning" about images.
Example: Send a photo of a broken machine and ask: "Identify the bolt that is missing based on the schematic I sent earlier."

The Creative Multi-Tasker (Titan Multimodal Embeddings)

Strengths: Creating a unified vector space for both text and images.
Example: Search for a product using a picture instead of words (Visual Search).

The Video/Audio Processors

While direct video processing is emerging, most current professional workflows involve:

Audio: Transcribing to text using Amazon Transcribe.
Video: Extracting keyframes and processing them as images with Claude 3.

3. Architecting Multi-Modal RAG

In a classic RAG system, your vector store only holds text. In Multi-modal RAG, your engine can store and retrieve images.

graph TD
    A[PDF with Diagrams] --> B[Unstructured Parsing]
    B --> C[Text Chunks]
    B --> D[Image/Chart Extraction]
    C --> E[Titan Text Embeddings]
    D --> F[Titan Multimodal Embeddings]
    E --> G[Vector Store]
    F --> G
    
    U[User Query: 'Show me the revenue chart'] --> G
    G --> H[Retrieve Image + Context]

4. Implementation: Handling Binary Data

When you send an image to a model in Bedrock, you generally don't send a URL (the model has no internet access). You must send the Base64-encoded bytes of the image.

Code Example: Sending an image to Claude 3

import boto3
import base64

def analyze_image(image_path, prompt):
    bedrock = boto3.client('bedrock-runtime')
    
    with open(image_path, "rb") as image_file:
        binary_data = image_file.read()
        base_64_encoded_data = base64.b64encode(binary_data).decode('utf-8')

    body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/jpeg",
                            "data": base_64_encoded_data
                        }
                    },
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ]
    }
    
    # Call the model (simplified)

5. Use Case: Automated Document Review

Imagine a mortgage company. They get 1,000 documents a day (scans of IDs, pay stubs, bank statements).

The Old Way: Use OCR to turn it into messy text, then use an LLM.
The Multi-modal Way: Send the image of the ID directly to Claude 3.5. Because Claude understands layout and visual context, it "sees" the difference between a driver's license and a utility bill much more reliably than a text-only model.

6. Pro-Tip: Context Squeezing

Images are token-heavy. One high-resolution image might consume as many tokens as 1,000 words.

Optimization: Before sending, resize the image to the smallest readable resolution (usually 512px to 1024px) to save on latency and cost.

Knowledge Check: Test Your Multi-modal Knowledge

Error: Quiz options are missing or invalid.

Summary

Multi-modality is the "Vision" of AI. By allowing models to see and process visual data directly, you unlock entirely new categories of automation. In the next lesson, we will dive deeper into Image, Video, and Audio Processing with Foundation Models.

Next Lesson: Processing the Senses: Image, Video, and Audio with FM

Seeing and Hearing: Building Multi-Modal GenAI Applications