
Image Embeddings
How to convert visual data into vectors for similarity search and visual RAG applications.
Image Embeddings
Just as we convert text to vectors, we can convert images into vectors. This allows us to search through thousands of images for something that "looks like" our query.
Visual Feature Extraction
An image embedding model (like CLIP or SigLIP) processes an image through several layers of a neural network to extract high-level features like:
- Objects: Is there a dog or a cat?
- Style: Is it a photo, a sketch, or a diagram?
- Colors: What is the dominant palette?
- Composition: Where is the subject located?
Measuring Visual Similarity
When two image vectors are close together in vector space, it means the images are visually similar.
Use Case: Architectural Drawings
Imagine you have 10,000 architectural blueprints. You can take a sketch of a "Spiral Staircase" and use an image embedding to find all blueprints that contain similar structures.
Implementation with CLIP (OpenAI)
CLIP (Contrastive Language-Image Pre-training) is the industry standard for bridging text and images.
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def get_image_embedding(image_path):
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
return image_features
Hosted Vision Embeddings
If you don't want to run your own models, cloud providers offer high-performance vision embeddings:
- AWS Bedrock: Titan Multimodal Embeddings.
- Google Cloud: Vertex AI Multimodal Embeddings.
Deduplication using Embeddings
One powerful use of image embeddings is finding near-duplicate images in your RAG pipeline. If an image vector is 99% similar to another, you can discard it to save space and reduce noise.
Exercises
- Find two different pictures of a "Golden Retriever".
- Use a CLIP model to generate embeddings for both.
- Calculate the cosine similarity. Is it closer to 1.0 (exact match) or 0.5 (random)?