Image Embeddings: Searching with Sight

Vector databases aren't just for text. In fact, some of the most powerful applications of vector search are Visual. A user uploads a photo of a sneaker, and your database finds the exact match in your inventory—even if the description doesn't mention the color or brand.

In this lesson, we explore Image Embeddings and the models that make them possible.

1. How Images become Vectors

Just as text models (like BERT) encode the "Meaning" of a sentence into a 1536-dimensional space, vision models (like ResNet or CLIP) encode the Pixels into a similar space.

The Vision Transformer (ViT): Breaks the image into patches (like tokens in text) and analyzes their relationships.
The Result: A vector that represents the "Visual Concepts" (e.g., "Round shape," "Blue texture," "Car in a forest").

2. CLIP: The Bridge between Text and Sight

The CLIP (Contrastive Language-Image Pre-training) model by OpenAI is the industry standard for multimodal search.

CLIP was trained on pairs of (Image, Caption).
It learned to project both the Image and the Text into the Same Vector Space.
The Result: You can query an Image Vector using a Text String.

graph LR
    A[Image: Dog on Grass] --> B[CLIP Image Encoder]
    C[Text: 'Pup in field'] --> D[CLIP Text Encoder]
    B --> E[Vector Space]
    D --> E
    E --> F{High Similarity!}

3. Implementation: Generating Image Vectors (Python)

Using the sentence-transformers library, which provides a simple wrapper for CLIP:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

# Load the CLIP model
model = SentenceTransformer('clip-ViT-B-32')

# 1. Encode the Image
img_emb = model.encode(Image.open('sneaker.jpg'))

# 2. Encode a Search String
text_emb = model.encode('A blue running shoe')

# 3. Calculate Similarity
cos_sim = util.cos_sim(img_emb, text_emb)
print(f"Match Probability: {cos_sim.item():.4f}")

4. Why Use a Vector DB for Images?

If you have 10,000 images, you could compute a similarity matrix in seconds. But if you have 10 Million images, you cannot compare the user's photo to every image one-by-one.

A Vector Database (like Pinecone or Chroma) allows you to perform this search in Milliseconds using HNSW or IVF indexes (Module 3).

5. Summary and Key Takeaways

Pixels to Meaning: Vision models turn raw pixel data into conceptual vectors.
Shared Space: Models like CLIP allow you to compare images to text directly.
Similarity Matters: We search for images that "Look" similar or "Mean" the same thing.
Scale needs DBs: Beyond a few thousand images, an HNSW index is mandatory for visual search.

In the next lesson, we’ll move into the world of sound with Audio Embeddings.

Image Embeddings: Searching with Sight

Image Embeddings: Searching with Sight

1. How Images become Vectors

2. CLIP: The Bridge between Text and Sight

3. Implementation: Generating Image Vectors (Python)

4. Why Use a Vector DB for Images?

5. Summary and Key Takeaways

Congratulations on completing Module 13 Lesson 1! You are now building visual AI systems.

Subscribe to our newsletter