Multimodal Embeddings

Multimodal Embeddings

Master the concept of shared vector spaces where text and images coexist and interact.

Multimodal Embeddings

The "holy grail" of Multimodal RAG is the shared embedding space. This is a vector space where a text description (e.g., "A photo of a sunrise") and the actual image of a sunrise are projected to the exact same location.

The Concept of Co-Embedding

Traditional systems used to tag images with keywords (text) and search only the keywords. Multimodal embeddings enable native search:

  • Text-to-Image: Search for photos using a natural language query.
  • Image-to-Image: Find similar photos using a reference image.
  • Image-to-Text: Autogenerate descriptions or find documents about an image.

Visualizing the Shared Space

Imagine the vector space as a map:

  • The word "Dog" and an image of a Husky are close together.
  • The word "Car" and an image of a Porsche are close together.
  • But "Dog" and the Porsche image are far apart.
graph LR
    subgraph "Vector Space"
        T1["Text: 'Beach'"] -- Near --> I1["Image: Beach.jpg"]
        T2["Text: 'Snow'"] -- Near --> I2["Image: Mountain.jpg"]
        I1 -- Far -- I2
    end

CLIP: The Bridging Model

CLIP was trained on 400 million image-text pairs from the internet. It learned that the visual concept of a "Stop Sign" matches the text "Stop Sign".

Key Benefit for RAG

You can index images without ever writing a single caption. The model "knows" what is in the image semantically.

Multimodal Embeddings in Production

Titan Multimodal (AWS Bedrock)

The Titan model can embed text, images, or both simultaneously. This is useful for searching for "This dress, but in blue text".

# Conceptual AWS Bedrock Call
body = json.dumps({
    "inputText": "blue summer dress",
    "inputImage": base64_image_data
})
# ... returns a single multimodal vector

Exercises

  1. Why is a shared embedding space better than just tagging images with alt-text?
  2. If you search for "Peaceful afternoon" using a multimodal embedding, what kind of images would you expect to find?
  3. What happens if the text query describes something the model has never seen (e.g., a specific new gadget)?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn