Project 1: Building a Semantic Search Engine

Project 1: Building a Semantic Search Engine

Build a complete, end-to-end semantic search engine for technical documentation. Master ingestion, indexing, and retrieval UI.

Project 1: Building a Semantic Search Engine

In this first hands-on project, you will apply everything you've learned to build a Professional Semantic Search Engine. We will use a dataset of technical documentation and build a system that can find answers even when the user uses different terminology than the writer.


1. Project Requirements

  • Data: A collection of 500+ Markdown or HTML files.
  • Database: Local ChromaDB (for speed and zero cost).
  • Embedding: sentence-transformers/all-MiniLM-L6-v2.
  • UI: A simple Python wrapper or a FastAPI endpoint.

2. Ingestion Pipeline

  1. Crawler: Read all .md files in a directory.
  2. Chunker: Split files by header (H1, H2).
  3. Embedder: Convert text chunks into 384-dimensional vectors.
  4. Indexer: Upsert into ChromaDB with file_path and heading as metadata.

3. The Core Code (Python)

import os
import chromadb
from sentence_transformers import SentenceTransformer

# 1. Init
client = chromadb.PersistentClient(path="./docs_db")
collection = client.get_or_create_collection("tech_docs")
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Indexing Function
def index_docs(doc_dir):
    for filename in os.listdir(doc_dir):
        with open(os.path.join(doc_dir, filename)) as f:
            text = f.read()
            # Simple chunking by paragraph
            chunks = text.split("\n\n")
            for i, chunk in enumerate(chunks):
                vec = model.encode(chunk).tolist()
                collection.add(
                    documents=[chunk],
                    embeddings=[vec],
                    metadatas=[{"source": filename}],
                    ids=[f"{filename}_{i}"]
                )

# 3. Search Function
def search(query):
    query_vec = model.encode(query).tolist()
    return collection.query(query_embeddings=[query_vec], n_results=5)

4. Evaluation Criteria

  • Recall: Does the engine find the "Installation Guide" when searching for "How do I set it up?"
  • Latency: Is the query response under 100ms?
  • Metadata: Do the results include the correct file source for clicking?

Deliverables

  1. An active .docs_db folder containing your vectors.
  2. A search.py script that takes a console input and prints the top 3 results.
  3. A short explanation of your chunking logic.

Ready to build? Let's turn documents into intelligence.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn