Multi-Modal Embeddings: The Unified Space

So far in this course, we have focused on Text. We have seen how words like "Dog" and "Puppy" end up near each other in vector space. But the human world isn't just text; it's images, sounds, and video.

What if you could search for a "Picture of a golden retriever" using a text query? Or find a "Song with a similar mood" to a picture of a sunset? This is possible through Multi-Modal Embeddings.

In this lesson, we explore the architecture of models like CLIP and ImageBind, and how they create a "Common Language" where a text vector and an image vector can be compared directly using Cosine Similarity.

1. The Multi-Modal Goal: Alignment

In previous modules, we used a Text Encoder to create vectors. In Multi-Modal AI, we use multiple encoders (one for text, one for images) that are trained to output vectors into the same coordinate system.

If the system is well-trained:

The vector for the word "Mountain" will be very close to...
The vector for a photograph of a mountain.

graph TD
    T[Text: 'Golden Retriever'] --> TE[Text Encoder]
    I[Image: Dog.jpg] --> IE[Image Encoder]
    TE --> VS((Common Vector Space))
    IE --> VS
    VS -.-> |Distance: 0.05| Close[Aligned Points]

2. CLIP: The Pioneer

CLIP (Contrastive Language-Image Pre-training), released by OpenAI, changed everything. It was trained on hundreds of millions of (Image, Caption) pairs from the internet.

How CLIP works:

It looks at an image and a caption.
It tries to make their vectors as similar as possible (High Dot Product).
Simultaneously, it tries to make the vector of that image as different as possible from any other caption (Low Dot Product).

Result: You can now search your vector database for images using text queries, without ever having to manually "tag" your photos with keywords.

3. ImageBind: Five Senses, One Vector

In 2023, Meta released ImageBind. While CLIP only handles Text + Images, ImageBind aligns six modalities:

Text
Image/Video
Audio
Depth (3D)
Thermal (Infrared)
IMU (Motion sensors)

This means you could, in theory, search for a video of a busy street using the audio of a car horn. The "meaning" of the car horn is mapped to the "meaning" of the visual car.

4. Why This Matters for Vector Databases

Multi-modal models turn your vector database into a Universal Search Engine.

You no longer need:

Google Vision API to label photos.
Transcription services to index audio.
Manual metadata for videos.

You simply Embed the raw file using a multi-modal encoder and store it in Pinecone or Chroma. The search "just works" regardless of if the query is a word, a sound, or another image.

5. Python Concept: The CLIP Workflow

Let's look at how we compare a text query to an image in Python using the OpenCLIP library.

import torch
import open_clip
from PIL import Image

# 1. Load the model and the pre-processor
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# 2. Encode an Image
image = preprocess(Image.open("dog.jpg")).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True) # Normalize

# 3. Encode some Text
text = tokenizer(["a photo of a dog", "a photo of a cat"])
with torch.no_grad():
    text_features = model.encode_text(text)
    text_features /= text_features.norm(dim=-1, keepdim=True) # Normalize

# 4. Compare (Cosine Similarity)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f"Similarity: {similarity}")

6. Challenges of Multi-modal Search

Dimensionality: Multi-modal vectors are often large (512D or 1024D), leading to higher storage costs.
Context Window: While text-only models handle thousands of words, multi-modal models often have small limits (e.g., CLIP only "sees" the first 77 tokens of a description).
Hardware: Running these models locally (Chroma) requires a GPU for reasonable speeds.

Summary and Key Takeaways

Multi-modality is the bridge between AI and the physical world.

Alignment is the process of putting different data types into one vector space.
CLIP provides the foundation for Image-to-Text and Text-to-Image search.
ImageBind extends this to audio, motion, and 3D data.
Coordinate Search: In a multi-modal database, a "Query" can be an image, and a "Result" can be a video.

In the next lesson, we will look at Storing Image and Video Vectors, focusing on the practical pipeline of chunking a video into frames for indexing.

Exercise: Multi-modal Intent

You are building a "Smart Home Security" app.

You have camera footage.
You have audio from a microphone.
You have motion sensor data.

How would ImageBind allow a user to ask "Show me when someone was running outside" using a text query?
Why is a multi-modal vector search better than a "Keyword" search for a video of a person stealing a package?
If you want to find "Barking dogs" in your data using a recording of a bark, is this an Image-to-Audio or an Audio-to-Audio search?

Multi-Modal Embeddings: Beyond the Written Word