
Project 1: Building a Semantic Search Engine
Build a complete, end-to-end semantic search engine for technical documentation. Master ingestion, indexing, and retrieval UI.
Project 1: Building a Semantic Search Engine
In this first hands-on project, you will apply everything you've learned to build a Professional Semantic Search Engine. We will use a dataset of technical documentation and build a system that can find answers even when the user uses different terminology than the writer.
1. Project Requirements
- Data: A collection of 500+ Markdown or HTML files.
- Database: Local ChromaDB (for speed and zero cost).
- Embedding:
sentence-transformers/all-MiniLM-L6-v2. - UI: A simple Python wrapper or a FastAPI endpoint.
2. Ingestion Pipeline
- Crawler: Read all
.mdfiles in a directory. - Chunker: Split files by header (H1, H2).
- Embedder: Convert text chunks into 384-dimensional vectors.
- Indexer: Upsert into ChromaDB with
file_pathandheadingas metadata.
3. The Core Code (Python)
import os
import chromadb
from sentence_transformers import SentenceTransformer
# 1. Init
client = chromadb.PersistentClient(path="./docs_db")
collection = client.get_or_create_collection("tech_docs")
model = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Indexing Function
def index_docs(doc_dir):
for filename in os.listdir(doc_dir):
with open(os.path.join(doc_dir, filename)) as f:
text = f.read()
# Simple chunking by paragraph
chunks = text.split("\n\n")
for i, chunk in enumerate(chunks):
vec = model.encode(chunk).tolist()
collection.add(
documents=[chunk],
embeddings=[vec],
metadatas=[{"source": filename}],
ids=[f"{filename}_{i}"]
)
# 3. Search Function
def search(query):
query_vec = model.encode(query).tolist()
return collection.query(query_embeddings=[query_vec], n_results=5)
4. Evaluation Criteria
- Recall: Does the engine find the "Installation Guide" when searching for "How do I set it up?"
- Latency: Is the query response under 100ms?
- Metadata: Do the results include the correct file source for clicking?
Deliverables
- An active
.docs_dbfolder containing your vectors. - A
search.pyscript that takes a console input and prints the top 3 results. - A short explanation of your chunking logic.