
Seeing and Hearing: Building Multi-Modal GenAI Applications
AI beyond the text box. Learn how to architect applications that can process images, video, and audio using multi-modal foundation models in Amazon Bedrock.
Beyond the Text Box
For the first year of the GenAI boom, we were obsessed with "Text-in, Text-out." But humans don't just communicate with text—we use diagrams, voices, and videos. A Multi-Modal Foundation Model is one that can process multiple types of data simultaneously.
In the AWS Certified Generative AI Developer – Professional exam, you must demonstrate how to build these complex, sensory-rich applications.
1. Defining Modalities
A "Modality" is a single type of input or output.
- Unimodal: Text $\rightarrow$ Text.
- Bi-modal: Text + Image $\rightarrow$ Text.
- Multi-modal: Text + Image + Video + Audio $\rightarrow$ Unified Reasoning.
2. Bedrock's Multi-Modal Engines
Not all models are multi-modal. As a pro-developer, you need to know which tool to pick for each "Sense."
The Vision Experts (Claude 3 / 3.5)
- Strengths: Best-in-class at "Reasoning" about images.
- Example: Send a photo of a broken machine and ask: "Identify the bolt that is missing based on the schematic I sent earlier."
The Creative Multi-Tasker (Titan Multimodal Embeddings)
- Strengths: Creating a unified vector space for both text and images.
- Example: Search for a product using a picture instead of words (Visual Search).
The Video/Audio Processors
While direct video processing is emerging, most current professional workflows involve:
- Audio: Transcribing to text using Amazon Transcribe.
- Video: Extracting keyframes and processing them as images with Claude 3.
3. Architecting Multi-Modal RAG
In a classic RAG system, your vector store only holds text. In Multi-modal RAG, your engine can store and retrieve images.
graph TD
A[PDF with Diagrams] --> B[Unstructured Parsing]
B --> C[Text Chunks]
B --> D[Image/Chart Extraction]
C --> E[Titan Text Embeddings]
D --> F[Titan Multimodal Embeddings]
E --> G[Vector Store]
F --> G
U[User Query: 'Show me the revenue chart'] --> G
G --> H[Retrieve Image + Context]
4. Implementation: Handling Binary Data
When you send an image to a model in Bedrock, you generally don't send a URL (the model has no internet access). You must send the Base64-encoded bytes of the image.
Code Example: Sending an image to Claude 3
import boto3
import base64
def analyze_image(image_path, prompt):
bedrock = boto3.client('bedrock-runtime')
with open(image_path, "rb") as image_file:
binary_data = image_file.read()
base_64_encoded_data = base64.b64encode(binary_data).decode('utf-8')
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base_64_encoded_data
}
},
{
"type": "text",
"text": prompt
}
]
}
]
}
# Call the model (simplified)
5. Use Case: Automated Document Review
Imagine a mortgage company. They get 1,000 documents a day (scans of IDs, pay stubs, bank statements).
- The Old Way: Use OCR to turn it into messy text, then use an LLM.
- The Multi-modal Way: Send the image of the ID directly to Claude 3.5. Because Claude understands layout and visual context, it "sees" the difference between a driver's license and a utility bill much more reliably than a text-only model.
6. Pro-Tip: Context Squeezing
Images are token-heavy. One high-resolution image might consume as many tokens as 1,000 words.
- Optimization: Before sending, resize the image to the smallest readable resolution (usually 512px to 1024px) to save on latency and cost.
Knowledge Check: Test Your Multi-modal Knowledge
?Knowledge Check
A developer wants to build a visual search feature for an e-commerce site where users can take a photo of a piece of furniture and find similar items in the catalog. Which AWS tool is best suited for generating the embeddings for both the product photos and the user's photos?
Summary
Multi-modality is the "Vision" of AI. By allowing models to see and process visual data directly, you unlock entirely new categories of automation. In the next lesson, we will dive deeper into Image, Video, and Audio Processing with Foundation Models.
Next Lesson: Processing the Senses: Image, Video, and Audio with FM