Project: Your Personal Visual Search Engine

Welcome to the Module 9 Capstone Project! You have learned the theory of CLIP, the architecture of multi-modal vectors, and the strategies for storing visual data. Now, we are going to build a Local Visual Search Engine.

This application will allow you to point to a folder of images on your computer and instantly search through them using:

Natural Language (e.g., "A photo of someone eating ice cream").
Visual Similarity (e.g., "Find more photos that look like this one").

1. Project Objectives

Ingestion: Efficiently load and embed a local directory of images.
Persistence: Save the visual index to disk using Chroma's PersistentClient.
Search UI: Build a Python CLI that displays the most relevant images.

2. Setting Up the Environment

We will use chromadb with its built-in OpenCLIP support.

pip install chromadb open_clip_torch pillow

Folder Structure:

/visual_search
  /images         <-- Your personal .jpg / .png photos
  /db             <-- Chroma data folder
  app.py          <-- Our main application

3. The Core Application Logic (`app.py`)

This script handles both the Ingestion and the Querying.

import os
import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
import sys

# 1. Setup
DB_PATH = "./db"
IMAGE_DIR = "./images"
COLLECTION_NAME = "my_visual_brain_v1"

# 2. Initialize Chroma with Multi-modal support
# This EF handles the CLIP model loading automatically
clip_ef = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()

client = chromadb.PersistentClient(path=DB_PATH)
collection = client.get_or_create_collection(
    name=COLLECTION_NAME, 
    embedding_function=clip_ef,
    data_loader=image_loader
)

def ingest_images():
    image_files = [f for f in os.listdir(IMAGE_DIR) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
    
    # Batch adding for speed
    image_paths = [os.path.join(IMAGE_DIR, f) for f in image_files]
    ids = [f"id_{i}" for i in range(len(image_files))]
    metadatas = [{"filename": f} for f in image_files]
    
    print(f"Ingesting {len(image_files)} images... this may take a minute.")
    
    # Chroma's multi-modal 'add' allows passing 'images' directly (list of paths)
    collection.add(
        ids=ids,
        images=image_paths, 
        metadatas=metadatas
    )
    print("Ingestion complete.")

def search_by_text(query_text):
    results = collection.query(
        query_texts=[query_text],
        n_results=3,
        include=['metadatas', 'distances']
    )
    print(f"\nSearching for: '{query_text}'")
    for i in range(len(results['ids'][0])):
        print(f"Match: {results['metadatas'][0][i]['filename']} (Score: {results['distances'][0][i]:.4f})")

def search_by_image(image_path):
    results = collection.query(
        query_images=[image_path],
        n_results=3,
        include=['metadatas', 'distances']
    )
    print(f"\nSearching for similar images to: '{image_path}'")
    for i in range(len(results['ids'][0])):
        print(f"Match: {results['metadatas'][0][i]['filename']} (Score: {results['distances'][0][i]:.4f})")

if __name__ == "__main__":
    # Simple CLI logic
    if len(sys.argv) < 2:
        print("Usage:")
        print("  python app.py ingest")
        print("  python app.py 'text query'")
        print("  python app.py ./some_image.jpg")
    else:
        command = sys.argv[1]
        if command == "ingest":
            ingest_images()
        elif os.path.exists(command):
            search_by_image(command)
        else:
            search_by_text(command)

4. Step-by-Step Walkthrough

Step 1: Populate your Image Folder

Gather 10-20 diverse photos. Include some landscapes, some people, and some specific objects (like a coffee mug or a car).

Step 2: Run Ingestion

python app.py ingest

You will see open_clip downloading the weight files the first time you run this.

Step 3: Run Text Search

python app.py "a beautiful landscape"

Check if the results make sense. Try complex queries like "something blue and metallic."

Step 4: Run Visual Search

Take a photo that you haven't indexed yet (but which is similar to others you have).

python app.py ./new_photo.jpg

5. Handling Performance (CPU vs. GPU)

If your ingestion is slow, it's because CLIP is running on your CPU.

Mac User: Chroma will try to use the Apple Silicon (MPS) automatically.
Windows/Linux: If you have a CUDA GPU, ensure pytorch is installed with CUDA support to speed this up by 10x.

6. Project Expansion Ideas

Add Web Previews: Use Flask or Streamlit to show the actual images in a browser instead of just the filenames in the terminal.
Object Detection: Run a YOLO model before ingestion and store the detected objects in the metadata. Then, use metadata filters (Module 5, Lesson 4) to narrow down your visual search.
Large-Scale: Try indexing 1,000 images. How much space does the ./db folder take on your disk?

Summary and Module 9 Wrap-up

You have built a bridge between words and images!

You mastered Multi-modal Ingestion.
You saw the power of CLIP-based Retrieval.
You built a Cross-query interface (Text and Image queries in one app).

What's next?

In Module 10: RAG with Vector Databases, we connect our vector stores to Large Language Models. We will move from "Finding Information" to "Answering Questions" using the data we retrieve.

Exercise: The Calibration Test

Search for "A photo of a dog."
Search for "An illustration of a dog."
Search for "A statue of a dog."

Does the vector database correctly separate these three distinct "flavors" of dog? How does the Architecture of the image (photo vs illustration) affect the vector distance?

Project: Building a Multi-Modal Visual Search Engine

Project: Your Personal Visual Search Engine

1. Project Objectives

2. Setting Up the Environment

3. The Core Application Logic (`app.py`)

4. Step-by-Step Walkthrough

Step 1: Populate your Image Folder

Step 2: Run Ingestion

Step 3: Run Text Search

Step 4: Run Visual Search

5. Handling Performance (CPU vs. GPU)

6. Project Expansion Ideas

Summary and Module 9 Wrap-up

What's next?

Exercise: The Calibration Test

Congratulations on completing Module 9! You are now a pioneer of the multi-modal AI era.

Subscribe to our newsletter

Project: Your Personal Visual Search Engine

1. Project Objectives

2. Setting Up the Environment

3. The Core Application Logic (app.py)

4. Step-by-Step Walkthrough

Step 1: Populate your Image Folder

Step 2: Run Ingestion

Step 3: Run Text Search

Step 4: Run Visual Search

5. Handling Performance (CPU vs. GPU)

6. Project Expansion Ideas

Summary and Module 9 Wrap-up

What's next?

Exercise: The Calibration Test

Congratulations on completing Module 9! You are now a pioneer of the multi-modal AI era.

Subscribe to our newsletter

3. The Core Application Logic (`app.py`)