Image Embeddings

Just as we convert text to vectors, we can convert images into vectors. This allows us to search through thousands of images for something that "looks like" our query.

Visual Feature Extraction

An image embedding model (like CLIP or SigLIP) processes an image through several layers of a neural network to extract high-level features like:

Objects: Is there a dog or a cat?
Style: Is it a photo, a sketch, or a diagram?
Colors: What is the dominant palette?
Composition: Where is the subject located?

Measuring Visual Similarity

When two image vectors are close together in vector space, it means the images are visually similar.

Use Case: Architectural Drawings

Imagine you have 10,000 architectural blueprints. You can take a sketch of a "Spiral Staircase" and use an image embedding to find all blueprints that contain similar structures.

Implementation with CLIP (OpenAI)

CLIP (Contrastive Language-Image Pre-training) is the industry standard for bridging text and images.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def get_image_embedding(image_path):
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
    return image_features

Hosted Vision Embeddings

If you don't want to run your own models, cloud providers offer high-performance vision embeddings:

AWS Bedrock: Titan Multimodal Embeddings.
Google Cloud: Vertex AI Multimodal Embeddings.

Deduplication using Embeddings

One powerful use of image embeddings is finding near-duplicate images in your RAG pipeline. If an image vector is 99% similar to another, you can discard it to save space and reduce noise.

Exercises

Find two different pictures of a "Golden Retriever".
Use a CLIP model to generate embeddings for both.
Calculate the cosine similarity. Is it closer to 1.0 (exact match) or 0.5 (random)?