
Multimodal Embeddings
Master the concept of shared vector spaces where text and images coexist and interact.
Multimodal Embeddings
The "holy grail" of Multimodal RAG is the shared embedding space. This is a vector space where a text description (e.g., "A photo of a sunrise") and the actual image of a sunrise are projected to the exact same location.
The Concept of Co-Embedding
Traditional systems used to tag images with keywords (text) and search only the keywords. Multimodal embeddings enable native search:
- Text-to-Image: Search for photos using a natural language query.
- Image-to-Image: Find similar photos using a reference image.
- Image-to-Text: Autogenerate descriptions or find documents about an image.
Visualizing the Shared Space
Imagine the vector space as a map:
- The word "Dog" and an image of a Husky are close together.
- The word "Car" and an image of a Porsche are close together.
- But "Dog" and the Porsche image are far apart.
graph LR
subgraph "Vector Space"
T1["Text: 'Beach'"] -- Near --> I1["Image: Beach.jpg"]
T2["Text: 'Snow'"] -- Near --> I2["Image: Mountain.jpg"]
I1 -- Far -- I2
end
CLIP: The Bridging Model
CLIP was trained on 400 million image-text pairs from the internet. It learned that the visual concept of a "Stop Sign" matches the text "Stop Sign".
Key Benefit for RAG
You can index images without ever writing a single caption. The model "knows" what is in the image semantically.
Multimodal Embeddings in Production
Titan Multimodal (AWS Bedrock)
The Titan model can embed text, images, or both simultaneously. This is useful for searching for "This dress, but in blue text".
# Conceptual AWS Bedrock Call
body = json.dumps({
"inputText": "blue summer dress",
"inputImage": base64_image_data
})
# ... returns a single multimodal vector
Exercises
- Why is a shared embedding space better than just tagging images with alt-text?
- If you search for "Peaceful afternoon" using a multimodal embedding, what kind of images would you expect to find?
- What happens if the text query describes something the model has never seen (e.g., a specific new gadget)?