Project 1: Building a Semantic Search Engine

In this first hands-on project, you will apply everything you've learned to build a Professional Semantic Search Engine. We will use a dataset of technical documentation and build a system that can find answers even when the user uses different terminology than the writer.

1. Project Requirements

Data: A collection of 500+ Markdown or HTML files.
Database: Local ChromaDB (for speed and zero cost).
Embedding: sentence-transformers/all-MiniLM-L6-v2.
UI: A simple Python wrapper or a FastAPI endpoint.

2. Ingestion Pipeline

Crawler: Read all .md files in a directory.
Chunker: Split files by header (H1, H2).
Embedder: Convert text chunks into 384-dimensional vectors.
Indexer: Upsert into ChromaDB with file_path and heading as metadata.

3. The Core Code (Python)

import os
import chromadb
from sentence_transformers import SentenceTransformer

# 1. Init
client = chromadb.PersistentClient(path="./docs_db")
collection = client.get_or_create_collection("tech_docs")
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Indexing Function
def index_docs(doc_dir):
    for filename in os.listdir(doc_dir):
        with open(os.path.join(doc_dir, filename)) as f:
            text = f.read()
            # Simple chunking by paragraph
            chunks = text.split("\n\n")
            for i, chunk in enumerate(chunks):
                vec = model.encode(chunk).tolist()
                collection.add(
                    documents=[chunk],
                    embeddings=[vec],
                    metadatas=[{"source": filename}],
                    ids=[f"{filename}_{i}"]
                )

# 3. Search Function
def search(query):
    query_vec = model.encode(query).tolist()
    return collection.query(query_embeddings=[query_vec], n_results=5)

4. Evaluation Criteria

Recall: Does the engine find the "Installation Guide" when searching for "How do I set it up?"
Latency: Is the query response under 100ms?
Metadata: Do the results include the correct file source for clicking?

Deliverables

An active .docs_db folder containing your vectors.
A search.py script that takes a console input and prints the top 3 results.
A short explanation of your chunking logic.

Project 1: Building a Semantic Search Engine

Project 1: Building a Semantic Search Engine

1. Project Requirements

2. Ingestion Pipeline

3. The Core Code (Python)

4. Evaluation Criteria

Deliverables

Ready to build? Let's turn documents into intelligence.

Subscribe to our newsletter