Project: Building a Multi-Modal Visual Search Engine

Project: Building a Multi-Modal Visual Search Engine

Put your multi-modal skills to the test. Build a local AI application that can search through images using either text descriptions or reference photos.

Project: Your Personal Visual Search Engine

Welcome to the Module 9 Capstone Project! You have learned the theory of CLIP, the architecture of multi-modal vectors, and the strategies for storing visual data. Now, we are going to build a Local Visual Search Engine.

This application will allow you to point to a folder of images on your computer and instantly search through them using:

  1. Natural Language (e.g., "A photo of someone eating ice cream").
  2. Visual Similarity (e.g., "Find more photos that look like this one").

1. Project Objectives

  • Ingestion: Efficiently load and embed a local directory of images.
  • Persistence: Save the visual index to disk using Chroma's PersistentClient.
  • Search UI: Build a Python CLI that displays the most relevant images.

2. Setting Up the Environment

We will use chromadb with its built-in OpenCLIP support.

pip install chromadb open_clip_torch pillow

Folder Structure:

/visual_search
  /images         <-- Your personal .jpg / .png photos
  /db             <-- Chroma data folder
  app.py          <-- Our main application

3. The Core Application Logic (app.py)

This script handles both the Ingestion and the Querying.

import os
import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
import sys

# 1. Setup
DB_PATH = "./db"
IMAGE_DIR = "./images"
COLLECTION_NAME = "my_visual_brain_v1"

# 2. Initialize Chroma with Multi-modal support
# This EF handles the CLIP model loading automatically
clip_ef = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()

client = chromadb.PersistentClient(path=DB_PATH)
collection = client.get_or_create_collection(
    name=COLLECTION_NAME, 
    embedding_function=clip_ef,
    data_loader=image_loader
)

def ingest_images():
    image_files = [f for f in os.listdir(IMAGE_DIR) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
    
    # Batch adding for speed
    image_paths = [os.path.join(IMAGE_DIR, f) for f in image_files]
    ids = [f"id_{i}" for i in range(len(image_files))]
    metadatas = [{"filename": f} for f in image_files]
    
    print(f"Ingesting {len(image_files)} images... this may take a minute.")
    
    # Chroma's multi-modal 'add' allows passing 'images' directly (list of paths)
    collection.add(
        ids=ids,
        images=image_paths, 
        metadatas=metadatas
    )
    print("Ingestion complete.")

def search_by_text(query_text):
    results = collection.query(
        query_texts=[query_text],
        n_results=3,
        include=['metadatas', 'distances']
    )
    print(f"\nSearching for: '{query_text}'")
    for i in range(len(results['ids'][0])):
        print(f"Match: {results['metadatas'][0][i]['filename']} (Score: {results['distances'][0][i]:.4f})")

def search_by_image(image_path):
    results = collection.query(
        query_images=[image_path],
        n_results=3,
        include=['metadatas', 'distances']
    )
    print(f"\nSearching for similar images to: '{image_path}'")
    for i in range(len(results['ids'][0])):
        print(f"Match: {results['metadatas'][0][i]['filename']} (Score: {results['distances'][0][i]:.4f})")

if __name__ == "__main__":
    # Simple CLI logic
    if len(sys.argv) < 2:
        print("Usage:")
        print("  python app.py ingest")
        print("  python app.py 'text query'")
        print("  python app.py ./some_image.jpg")
    else:
        command = sys.argv[1]
        if command == "ingest":
            ingest_images()
        elif os.path.exists(command):
            search_by_image(command)
        else:
            search_by_text(command)

4. Step-by-Step Walkthrough

Step 1: Populate your Image Folder

Gather 10-20 diverse photos. Include some landscapes, some people, and some specific objects (like a coffee mug or a car).

Step 2: Run Ingestion

python app.py ingest

You will see open_clip downloading the weight files the first time you run this.

Step 3: Run Text Search

python app.py "a beautiful landscape"

Check if the results make sense. Try complex queries like "something blue and metallic."

Step 4: Run Visual Search

Take a photo that you haven't indexed yet (but which is similar to others you have).

python app.py ./new_photo.jpg

5. Handling Performance (CPU vs. GPU)

If your ingestion is slow, it's because CLIP is running on your CPU.

  • Mac User: Chroma will try to use the Apple Silicon (MPS) automatically.
  • Windows/Linux: If you have a CUDA GPU, ensure pytorch is installed with CUDA support to speed this up by 10x.

6. Project Expansion Ideas

  1. Add Web Previews: Use Flask or Streamlit to show the actual images in a browser instead of just the filenames in the terminal.
  2. Object Detection: Run a YOLO model before ingestion and store the detected objects in the metadata. Then, use metadata filters (Module 5, Lesson 4) to narrow down your visual search.
  3. Large-Scale: Try indexing 1,000 images. How much space does the ./db folder take on your disk?

Summary and Module 9 Wrap-up

You have built a bridge between words and images!

  • You mastered Multi-modal Ingestion.
  • You saw the power of CLIP-based Retrieval.
  • You built a Cross-query interface (Text and Image queries in one app).

What's next?

In Module 10: RAG with Vector Databases, we connect our vector stores to Large Language Models. We will move from "Finding Information" to "Answering Questions" using the data we retrieve.


Exercise: The Calibration Test

  1. Search for "A photo of a dog."
  2. Search for "An illustration of a dog."
  3. Search for "A statue of a dog."

Does the vector database correctly separate these three distinct "flavors" of dog? How does the Architecture of the image (photo vs illustration) affect the vector distance?


Congratulations on completing Module 9! You are now a pioneer of the multi-modal AI era.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn