Project: Building a Local Semantic Search Index with Chroma

Project: Building a Local Semantic Search Index with Chroma

Put your knowledge into practice. Built a complete, persistent semantic search engine for local text files using Python and ChromaDB.

Project: Building Your Own Semantic Search Engine

Congratulations! You have finished the theoretical core of our Chroma module. Now, it's time to build.

In this session, we will move beyond code snippets and build a production-structured local search engine. We will ingest a directory of text files, chunk them semantically, store them in a persistent Chroma database, and build a search CLI (Command Line Interface).


1. The Project Goals

By the end of this exercise, you will have a Python tool that can:

  1. Discover: Find all .txt files in a given folder.
  2. Chunk: Split those files into meaningful pieces.
  3. Persist: Index those pieces into Chroma and save them to disk.
  4. Retrieve: Ask a natural language question and get the most relevant paragraphs.

2. Setting Up the Environment

Ensure you have your tools ready. You will need chromadb and sentence-transformers.

pip install chromadb sentence-transformers

Let's create a folder structure:

/my_search_project
  /data           <-- Put some .txt files here
  /db             <-- Chroma will save data here
  ingest.py       <-- Our ingestion logic
  search.py       <-- Our search logic

3. Step 1: The Ingestor (ingest.py)

This script handles the heavy lifting of turning files into vectors.

import os
import chromadb
from chromadb.utils import embedding_functions

# 1. Configuration
DATA_DIR = "./data"
DB_PATH = "./db"
COLLECTION_NAME = "local_docs_v1"

# 2. Initialize Chroma
client = chromadb.PersistentClient(path=DB_PATH)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
collection = client.get_or_create_collection(name=COLLECTION_NAME, embedding_function=ef)

def chunk_text(text, size=500, overlap=50):
    # Simple chunking logic (Simplified for the exercise)
    return [text[i:i + size] for i in range(0, len(text), size - overlap)]

def run_ingestion():
    files = [f for f in os.listdir(DATA_DIR) if f.endswith('.txt')]
    
    for filename in files:
        print(f"Processing {filename}...")
        path = os.path.join(DATA_DIR, filename)
        
        with open(path, 'r') as f:
            content = f.read()
            
        chunks = chunk_text(content)
        
        # Prepare metadata and IDs for each chunk
        doc_ids = [f"{filename}_{i}" for i in range(len(chunks))]
        metadatas = [{"source": filename, "chunk_index": i} for i in range(len(chunks))]
        
        # Add to Chroma
        collection.add(
            documents=chunks,
            ids=doc_ids,
            metadatas=metadatas
        )

if __name__ == "__main__":
    if not os.path.exists(DATA_DIR):
        os.makedirs(DATA_DIR)
        print(f"Please put some .txt files in '{DATA_DIR}' and run again.")
    else:
        run_ingestion()
        print("Ingestion complete!")

4. Step 2: The Search Engine (search.py)

This script allows you to query the index we just built.

import chromadb
from chromadb.utils import embedding_functions
import sys

# 1. Load the existing DB
client = chromadb.PersistentClient(path="./db")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
collection = client.get_collection(name="local_docs_v1", embedding_function=ef)

def search_query(query_text):
    results = collection.query(
        query_texts=[query_text],
        n_results=3,
        include=['documents', 'metadatas', 'distances']
    )
    
    print(f"\nResults for: '{query_text}'")
    print("="*40)
    
    for i in range(len(results['ids'][0])):
        doc = results['documents'][0][i]
        meta = results['metadatas'][0][i]
        dist = results['distances'][0][i]
        
        print(f"SOURCE: {meta['source']}")
        print(f"SIMILARITY: {1-dist:.4f}")
        print(f"CONTENT: {doc[:150]}...")
        print("-" * 20)

if __name__ == "__main__":
    if len(sys.argv) > 1:
        query = " ".join(sys.argv[1:])
        search_query(query)
    else:
        print("Usage: python search.py 'your natural language question'")

5. Testing Your Project

  1. Add a text file to /data containing the plot of a movie or a Wikipedia article.
  2. Run python ingest.py. Wait for it to finish.
  3. Run python search.py "Why did the main character do that?".
  4. Observe the results. Notice how the search finds the relevant paragraph even if your words don't exactly match the text.

6. Going Further: The Edge Case Challenge

Now that the basics work, try to break it:

  1. Typo Handling: Search with a typo (e.g., "Pinaeapple"). Does it still find the result?
  2. Concept Search: If your text is about "Cooking," search for "Culinary arts."
  3. Large Files: Add a 50MB log file. Does it slow down the ingestion significantly? (Monitor your RAM!).

Summary and Module 5 Wrap-up

You have successfully built a local AI infrastructure!

  • You mastered Persistent Storage.
  • You implemented Semantic Chunking.
  • You used Local Embedding Models (sentence-transformers).
  • You built a CLI interface for vector search.

What's Next?

In Module 6: Getting Started with Pinecone, we take our skills to the cloud. We will learn how to move from a single machine to a managed service that can handle trillions of vectors across a global infrastructure.


Final Exercise: Multi-Collection Support

Modify your ingest.py so that it uses a different collection name based on the folder name.

  • Files in /data/work go to the work_docs collection.
  • Files in /data/personal go to the personal_docs collection.

Then, update search.py so the user can choose which "category" to search.


Congratulations on finishing Module 5! You are officially a Vector Database practitioner.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn