
Project: Building a Multi-Modal Visual Search Engine
Put your multi-modal skills to the test. Build a local AI application that can search through images using either text descriptions or reference photos.
Project: Your Personal Visual Search Engine
Welcome to the Module 9 Capstone Project! You have learned the theory of CLIP, the architecture of multi-modal vectors, and the strategies for storing visual data. Now, we are going to build a Local Visual Search Engine.
This application will allow you to point to a folder of images on your computer and instantly search through them using:
- Natural Language (e.g., "A photo of someone eating ice cream").
- Visual Similarity (e.g., "Find more photos that look like this one").
1. Project Objectives
- Ingestion: Efficiently load and embed a local directory of images.
- Persistence: Save the visual index to disk using Chroma's
PersistentClient. - Search UI: Build a Python CLI that displays the most relevant images.
2. Setting Up the Environment
We will use chromadb with its built-in OpenCLIP support.
pip install chromadb open_clip_torch pillow
Folder Structure:
/visual_search
/images <-- Your personal .jpg / .png photos
/db <-- Chroma data folder
app.py <-- Our main application
3. The Core Application Logic (app.py)
This script handles both the Ingestion and the Querying.
import os
import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
import sys
# 1. Setup
DB_PATH = "./db"
IMAGE_DIR = "./images"
COLLECTION_NAME = "my_visual_brain_v1"
# 2. Initialize Chroma with Multi-modal support
# This EF handles the CLIP model loading automatically
clip_ef = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()
client = chromadb.PersistentClient(path=DB_PATH)
collection = client.get_or_create_collection(
name=COLLECTION_NAME,
embedding_function=clip_ef,
data_loader=image_loader
)
def ingest_images():
image_files = [f for f in os.listdir(IMAGE_DIR) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
# Batch adding for speed
image_paths = [os.path.join(IMAGE_DIR, f) for f in image_files]
ids = [f"id_{i}" for i in range(len(image_files))]
metadatas = [{"filename": f} for f in image_files]
print(f"Ingesting {len(image_files)} images... this may take a minute.")
# Chroma's multi-modal 'add' allows passing 'images' directly (list of paths)
collection.add(
ids=ids,
images=image_paths,
metadatas=metadatas
)
print("Ingestion complete.")
def search_by_text(query_text):
results = collection.query(
query_texts=[query_text],
n_results=3,
include=['metadatas', 'distances']
)
print(f"\nSearching for: '{query_text}'")
for i in range(len(results['ids'][0])):
print(f"Match: {results['metadatas'][0][i]['filename']} (Score: {results['distances'][0][i]:.4f})")
def search_by_image(image_path):
results = collection.query(
query_images=[image_path],
n_results=3,
include=['metadatas', 'distances']
)
print(f"\nSearching for similar images to: '{image_path}'")
for i in range(len(results['ids'][0])):
print(f"Match: {results['metadatas'][0][i]['filename']} (Score: {results['distances'][0][i]:.4f})")
if __name__ == "__main__":
# Simple CLI logic
if len(sys.argv) < 2:
print("Usage:")
print(" python app.py ingest")
print(" python app.py 'text query'")
print(" python app.py ./some_image.jpg")
else:
command = sys.argv[1]
if command == "ingest":
ingest_images()
elif os.path.exists(command):
search_by_image(command)
else:
search_by_text(command)
4. Step-by-Step Walkthrough
Step 1: Populate your Image Folder
Gather 10-20 diverse photos. Include some landscapes, some people, and some specific objects (like a coffee mug or a car).
Step 2: Run Ingestion
python app.py ingest
You will see open_clip downloading the weight files the first time you run this.
Step 3: Run Text Search
python app.py "a beautiful landscape"
Check if the results make sense. Try complex queries like "something blue and metallic."
Step 4: Run Visual Search
Take a photo that you haven't indexed yet (but which is similar to others you have).
python app.py ./new_photo.jpg
5. Handling Performance (CPU vs. GPU)
If your ingestion is slow, it's because CLIP is running on your CPU.
- Mac User: Chroma will try to use the Apple Silicon (MPS) automatically.
- Windows/Linux: If you have a CUDA GPU, ensure
pytorchis installed with CUDA support to speed this up by 10x.
6. Project Expansion Ideas
- Add Web Previews: Use
FlaskorStreamlitto show the actual images in a browser instead of just the filenames in the terminal. - Object Detection: Run a YOLO model before ingestion and store the detected objects in the metadata. Then, use metadata filters (Module 5, Lesson 4) to narrow down your visual search.
- Large-Scale: Try indexing 1,000 images. How much space does the
./dbfolder take on your disk?
Summary and Module 9 Wrap-up
You have built a bridge between words and images!
- You mastered Multi-modal Ingestion.
- You saw the power of CLIP-based Retrieval.
- You built a Cross-query interface (Text and Image queries in one app).
What's next?
In Module 10: RAG with Vector Databases, we connect our vector stores to Large Language Models. We will move from "Finding Information" to "Answering Questions" using the data we retrieve.
Exercise: The Calibration Test
- Search for "A photo of a dog."
- Search for "An illustration of a dog."
- Search for "A statue of a dog."
Does the vector database correctly separate these three distinct "flavors" of dog? How does the Architecture of the image (photo vs illustration) affect the vector distance?