Image and Multimodal Embeddings

In previous lessons, we learned how language becomes math. But the real "magic" of modern AI is its ability to connect different types of data—like finding a picture of a "sunny beach" using only text.

To do this, we need Multimodal Embeddings. This involves mapping images, text, and sometimes audio into a single, shared vector space.

In this lesson, we will explore the architecture of Vision Transformers (ViT) and the CLIP model, which serves as the "Rosetta Stone" of the modern multimodal AI stack.

1. How Image Embeddings Work (Vision Transformers)

Before Transformers, we used Convolutional Neural Networks (CNNs) to "look" at images. While effective, CNNs see things locally (pixels neighboring pixels).

Modern Vision Transformers (ViT) act much like text models:

Patches: The image is sliced into a grid of small squares (patches).
Flattening: Each patch is flattened into a vector.
Linear Projection: Each patch vector is converted into an embedding.
Self-Attention: Just like words in a sentence, the patches "look" at each other to understand the whole image.

graph TD
    A[Raw Image] --> B[Slice into 16x16 Patches]
    B --> C[Flatten & Project]
    C --> D[Sequence of Patch Embeddings]
    D --> E[Transformer Layers]
    E --> F[Global Image Vector]

The output is a single vector (e.g., 512 dimensions) that captures the "essence" of the image—its style, subjects, colors, and layout.

2. The Breakthrough: Contrastive Language-Image Pre-training (CLIP)

The problem with early image models was that they didn't understand language. You could find "similar images," but you couldn't search for them with words.

CLIP (developed by OpenAI) solved this by training two models simultaneously:

An Image Encoder.
A Text Encoder.

The Learning Goal

The model was fed 400 million pairs of (Image, Description). The goal was simple:

Make the vector of the Image and the vector of its Description as close as possible.

Anything else (wrong descriptions) should be pushed as far away as possible in the space.

graph LR
    subgraph Shared_Space
    T[Text: 'Dog in the park']
    I[Image of Dog]
    T --- |Close| I
    T -...- |Far| J[Image of Cat]
    end

Because they were trained in the same space, we can now compare a text vector and an image vector directly using Cosine Similarity.

3. The Unified Vector Space

In a unified space, the coordinates for the word "Sunset" and the coordinates for a JPEG file of a sunset are effectively the same.

This is the foundation of:

Text-to-Image Search: Searching a photo library for "my birthday cake."
Image-to-Image Search: Finding "shoes that look like these."
Zero-shot Classification: Labeling an image ("Is this a car or a boat?") by simply checking which text vector is closer.

4. Multimodal Search in Production

When using a vector database for multimodal data, your indexing pipeline changes:

For Images:

Ingest image.
Pass through CLIP Image Encoder.
Store Vector in DB.

For Text Queries:

User types "Beach at night."
Pass through CLIP Text Encoder.
Search Image Vector Store.

5. Python Example: Visualizing Image-Text Similarity with CLIP

Let's use the transformers and torch libraries to see how a text prompt matches different images.

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

# 1. Load CLIP Model and Processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# 2. Get some sample images
url_cat = "http://images.cocodataset.org/val2017/000000039769.jpg" # Image of 2 cats
image = Image.open(requests.get(url_cat, stream=True).raw)

# 3. Define candidate text prompts
labels = ["a photo of two cats", "a photo of a dog", "a remote control"]

# 4. Prepare inputs
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

# 5. Forward Pass
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # turn into probabilities

# 6. View Results
for label, prob in zip(labels, probs[0]):
    print(f"Prob ability that this is '{label}': {prob:.4f}")

Why this is powerful:

The model correctly identifies the image as "two cats" because the mathematical distance between that text vector and the image vector is the smallest. Notice that it works even though the model was never "specifically" trained to identify cats—it just knows what cats look like from its massive pre-training.

6. Challenges in Multimodal Embeddings

While CLIP is revolutionary, it has limitations that AI engineers must manage:

Resolution Limits: Most CLIP models resize images to very small sizes (e.g., 224x224 pixels). Small details (like text on a sign or a tiny object in the background) might be lost.
Counting and Logic: CLIP is notoriously bad at counting. It might not know the difference between "three dogs" and "two dogs."
Storage Overhead: Multimodal vectors often require more storage because they are "highly descriptive."
Data Privacy: Using hosted multimodal APIs often involves sending images over the wire, which might be a compliance risk.

7. The Future: Audio and Video

The same multimodal logic is being applied to:

Audio Embeddings: Using models like CLAP to search for "sound of a siren."
Video Embeddings: Treating a video as a sequence of image patches over time, allowing you to search for "a man running through a field."

In all these cases, the Vector Database remains the same. It doesn't care if the vector came from a pixel or a phoneme; it just performs the similarity search.

Summary and Key Takeaways

Multimodal embeddings create a "Universal Language" for AI.

Vision Transformers (ViT) process images like text, using attention across patches.
CLIP aligns Image and Text encoders into a single shared vector space.
Cross-modal Retrieval allows text to find images and vice-versa.
Shared Space means your vector database can store vectors from any source, as long as they were generated by the same multimodal model.

In the next lesson, we will look at a critical technical factor: Embedding Dimensionality. We will learn why 1536 is the "magic number" for OpenAI and how choosing the wrong dimensionality can destroy your search performance.

Exercise: Multimodal Querying

Imagine you are building a search engine for a stock photo website.

A user searches for "The feeling of success."
You have 1 million images.
How would you use CLIP to find "success" without any manual tags?
What kind of images do you think the vector database would return first? (e.g., people in suits, hikers on a mountain, graduates?).

Think about how "Abstract Concepts" are represented in pixels versus words.

Image and Multimodal Embeddings: Seeing with Math