
Project: Building a Local Semantic Search Index with Chroma
Put your knowledge into practice. Built a complete, persistent semantic search engine for local text files using Python and ChromaDB.
Project: Building Your Own Semantic Search Engine
Congratulations! You have finished the theoretical core of our Chroma module. Now, it's time to build.
In this session, we will move beyond code snippets and build a production-structured local search engine. We will ingest a directory of text files, chunk them semantically, store them in a persistent Chroma database, and build a search CLI (Command Line Interface).
1. The Project Goals
By the end of this exercise, you will have a Python tool that can:
- Discover: Find all
.txtfiles in a given folder. - Chunk: Split those files into meaningful pieces.
- Persist: Index those pieces into Chroma and save them to disk.
- Retrieve: Ask a natural language question and get the most relevant paragraphs.
2. Setting Up the Environment
Ensure you have your tools ready. You will need chromadb and sentence-transformers.
pip install chromadb sentence-transformers
Let's create a folder structure:
/my_search_project
/data <-- Put some .txt files here
/db <-- Chroma will save data here
ingest.py <-- Our ingestion logic
search.py <-- Our search logic
3. Step 1: The Ingestor (ingest.py)
This script handles the heavy lifting of turning files into vectors.
import os
import chromadb
from chromadb.utils import embedding_functions
# 1. Configuration
DATA_DIR = "./data"
DB_PATH = "./db"
COLLECTION_NAME = "local_docs_v1"
# 2. Initialize Chroma
client = chromadb.PersistentClient(path=DB_PATH)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
collection = client.get_or_create_collection(name=COLLECTION_NAME, embedding_function=ef)
def chunk_text(text, size=500, overlap=50):
# Simple chunking logic (Simplified for the exercise)
return [text[i:i + size] for i in range(0, len(text), size - overlap)]
def run_ingestion():
files = [f for f in os.listdir(DATA_DIR) if f.endswith('.txt')]
for filename in files:
print(f"Processing {filename}...")
path = os.path.join(DATA_DIR, filename)
with open(path, 'r') as f:
content = f.read()
chunks = chunk_text(content)
# Prepare metadata and IDs for each chunk
doc_ids = [f"{filename}_{i}" for i in range(len(chunks))]
metadatas = [{"source": filename, "chunk_index": i} for i in range(len(chunks))]
# Add to Chroma
collection.add(
documents=chunks,
ids=doc_ids,
metadatas=metadatas
)
if __name__ == "__main__":
if not os.path.exists(DATA_DIR):
os.makedirs(DATA_DIR)
print(f"Please put some .txt files in '{DATA_DIR}' and run again.")
else:
run_ingestion()
print("Ingestion complete!")
4. Step 2: The Search Engine (search.py)
This script allows you to query the index we just built.
import chromadb
from chromadb.utils import embedding_functions
import sys
# 1. Load the existing DB
client = chromadb.PersistentClient(path="./db")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
collection = client.get_collection(name="local_docs_v1", embedding_function=ef)
def search_query(query_text):
results = collection.query(
query_texts=[query_text],
n_results=3,
include=['documents', 'metadatas', 'distances']
)
print(f"\nResults for: '{query_text}'")
print("="*40)
for i in range(len(results['ids'][0])):
doc = results['documents'][0][i]
meta = results['metadatas'][0][i]
dist = results['distances'][0][i]
print(f"SOURCE: {meta['source']}")
print(f"SIMILARITY: {1-dist:.4f}")
print(f"CONTENT: {doc[:150]}...")
print("-" * 20)
if __name__ == "__main__":
if len(sys.argv) > 1:
query = " ".join(sys.argv[1:])
search_query(query)
else:
print("Usage: python search.py 'your natural language question'")
5. Testing Your Project
- Add a text file to
/datacontaining the plot of a movie or a Wikipedia article. - Run
python ingest.py. Wait for it to finish. - Run
python search.py "Why did the main character do that?". - Observe the results. Notice how the search finds the relevant paragraph even if your words don't exactly match the text.
6. Going Further: The Edge Case Challenge
Now that the basics work, try to break it:
- Typo Handling: Search with a typo (e.g., "Pinaeapple"). Does it still find the result?
- Concept Search: If your text is about "Cooking," search for "Culinary arts."
- Large Files: Add a 50MB log file. Does it slow down the ingestion significantly? (Monitor your RAM!).
Summary and Module 5 Wrap-up
You have successfully built a local AI infrastructure!
- You mastered Persistent Storage.
- You implemented Semantic Chunking.
- You used Local Embedding Models (sentence-transformers).
- You built a CLI interface for vector search.
What's Next?
In Module 6: Getting Started with Pinecone, we take our skills to the cloud. We will learn how to move from a single machine to a managed service that can handle trillions of vectors across a global infrastructure.
Final Exercise: Multi-Collection Support
Modify your ingest.py so that it uses a different collection name based on the folder name.
- Files in
/data/workgo to thework_docscollection. - Files in
/data/personalgo to thepersonal_docscollection.
Then, update search.py so the user can choose which "category" to search.