Text-to-Image and Image-to-Image Search

In the previous lessons, we learned how to store visual vectors. Now, we learn how to find them. In a multi-modal system, your query is no longer limited to the keyboard.

We are going to explore two transformative search patterns:

Text-to-Image (T2I): A user types a description and finds the most relevant photo.
Image-to-Image (I2I): A user uploads a reference photo and finds photos that have a "similar vibe" or content.

These patterns are the foundation of modern apps like Pinterest, Google Photos, and high-end E-commerce platforms.

1. Text-to-Image: The Bridge of Meaning

This is the "Magic" of CLIP. Because the model has been trained to align text and pixels, the vector for the sentence "A cat wearing a party hat" is physically close to the vector of an image of a cat in a party hat.

The Search Process:

Query: The user types "Sunsets over the ocean."
Embed: You run that text through the CLIP Text Encoder.
Search: You take the resulting vector and search your database of Image Vectors.
Result: The database returns the IDs of the images with the highest cosine similarity.

2. Image-to-Image: "Find more like this"

This is often called Visual Recommendation. Instead of describing a dream, the user "shows" the AI what they want.

The Search Process:

Query: The user uploads a photo of a specific floral pattern.
Embed: You run that image through the CLIP Image Encoder.
Search: You search your database of Image Vectors.
Result: The database returns images that are visually similar in color, composition, and content.

graph TD
    subgraph Multi_Modal_Models
    TE[Text Encoder]
    IE[Image Encoder]
    end
    
    Q1[Text Query] --> TE
    Q2[Image Query] --> IE
    
    TE --> V[Vector Search]
    IE --> V
    
    V --> DB[(Pinecone/Chroma)]
    DB --> R[Visual Results]

3. The "Similarity Score" Problem in Images

In text search, a similarity of 0.8 usually means "A great match." In image search, however, CLIP similarity scores are often much closer together.

0.35: Might be a high-quality match.
0.25: Might be completely irrelevant.

Production Tip: You cannot use a hard cutoff (like score > 0.8) for multi-modal search. You must use Relative Ranking (Top-K) or perform your own "Calibration" based on the specific CLIP model you are using.

4. Multi-Search: Combining Text and Image

The most advanced applications allow Weighted Hybrid Search. Imagine a user selects a photo of a blue shirt but then types "Make it red" in a search box.

The Workflow:

Encode Blue Shirt (Image Vector Vi).
Encode "Red" (Text Vector Vt).
Calculate Centroid: FinalQuery = (0.7 * Vi) + (0.3 * Vt).
Perform the search with FinalQuery.

The vector search will find a shirt that matches the style of the image but shifts the semantic category toward the color red.

5. Python Example: Implementing Visual Search with Chroma

import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction

# 1. Setup Chroma with the Multi-modal helper
# This helper automatically handles image loading and text encoding
clip_ef = OpenCLIPEmbeddingFunction()
client = chromadb.Client()
collection = client.get_or_create_collection("my_gallery", embedding_function=clip_ef)

# 2. Add Images (Chroma handles the logic if you pass image paths)
# collection.add(ids=["img1"], images=["./dog.jpg"])

# 3. TEXT-TO-IMAGE SEARCH
results_text = collection.query(
    query_texts=["A fluffy white animal"],
    n_results=1
)
print(f"Found via text: {results_text['ids']}")

# 4. IMAGE-TO-IMAGE SEARCH
# You pass a new image, and it finds the similar ones in the DB
results_img = collection.query(
    query_images=["./reference_photo.png"],
    n_results=2
)
print(f"Found via image: {results_img['ids']}")

6. Real-World Applications

Stock Photography: Searching "Peaceful morning" instead of using 50 keywords.
Retail: A shopper takes a photo of a shoe in the street and finds it on your store.
Media Archives: A news editor searches for "Politicians shaking hands" across 50 years of footage.
Crime Investigation: Finding a specific car color and type across thousands of CCTV cameras.

Summary and Key Takeaways

Text-to-Image and Image-to-Image are the two pillars of visual intelligence.

CLIP makes text and images interchangeable in the coordinate system.
Text-to-Image captures conceptual "meaning" better than keyword tags.
Image-to-Image handles color, style, and pattern perfectly.
Vector Arithmetic (Adding text and image vectors) allows for "Interactive Search."

In the next lesson, we will look at Audio and Speech Embeddings, exploring how sounds and music can be turned into coordinates.

Exercise: Interactive Search UI

You are building a "Furniture Search" app.

A user uploads a picture of a Modern Sofa.
The user then selects a filter for "Minimalist" (Text).

How would you combine these two signals into a single vector query?
If the user doesn't like the results, would you increase the weight of the Image or the Text?
Designing his system, why is a vector search more powerful than just showing "Sofas" that are tagged as "Minimalist"?

Text-to-Image and Image-to-Image: Querying the Visual Brain

Text-to-Image and Image-to-Image Search

1. Text-to-Image: The Bridge of Meaning

The Search Process:

2. Image-to-Image: "Find more like this"

The Search Process:

3. The "Similarity Score" Problem in Images

4. Multi-Search: Combining Text and Image

5. Python Example: Implementing Visual Search with Chroma

6. Real-World Applications

Summary and Key Takeaways

Exercise: Interactive Search UI

Congratulations! You are building the future of UI.

Subscribe to our newsletter