Vector Databases: The Long-Term Memory of AI

In Module 2, we learned that words can be converted into numbers called Embeddings. In Module 5, Lesson 2, we learned how to slice documents into chunks. Now, we need a place to store those embeddings and chunks so we can search them at light speed.

Enter the Vector Database.

Traditional databases (like MySQL) are good at finding exact matches: "Find user where id = 5". Vector databases are good at finding Near Matches: "Find the 5 text chunks that mean something similar to 'sick leave policy'."

1. How a Vector Database Works

Unlike a spreadsheet, a Vector DB stores data in a multi-dimensional space.

graph TD
    A[Add Data: text + vector] --> B[Index: HNSW / IVF]
    B --> C[Storage: Cloud/Local Disk]
    
    D[Query: 'How to pay?'] --> E[Convert Query to Vector]
    E --> F[Vector Search: Find closest neighbors]
    F --> G[Return Top K Results]

The Keyword: "Top K"

In a vector database, we don't ask for "The Result." We ask for the Top K (usually Top 3 or Top 5) results. These are the most mathematically similar documents to the user's query.

2. Comparing the Major Players

As an LLM Engineer, you will likely choose between one of these three.

A. Chroma (Local-First)

Role: Open-source and lightweight.
Best For: Prototypes, local research, and applications that run on a single machine or in a Docker container.
Pros: Super easy to set up. Free.
Cons: Harder to scale to millions of documents without complex infrastructure.

B. Pinecone (Serverless/Managed)

Role: The industry leader for "as-a-service" vector storage.
Best For: Production applications where you don't want to manage servers.
Pros: Handles billions of vectors. Managed scaling. Excellent "Metadata Filtering."
Cons: Can get expensive as your data grows. Vendor lock-in.

C. Weaviate (Hybrid/Modular)

Role: An open-source vector database that you can self-host or use as a cloud service.
Best For: Enterprises that want control over their infrastructure but also need scale.
Pros: Built-in support for multiple ML models. Graph-like capabilities.
Cons: Steeper learning curve than Chroma.

3. Metadata Filtering (The Hybrid Search)

Pure vector search has a weakness: it doesn't know about facts. If you search for "Quarterly Report," it might return the report from 2019 because the language is similar to the 2024 report.

The Solution: Use Metadata. You store the year, department, and category alongside the vector.

Query: "Get Top 3 vectors for 'Quarterly Report' WHERE 'year' = 2024"

Code Example: Using Chroma (The Local Choice)

Chroma is great for learning because it requires no API keys.

import chromadb
from chromadb.utils import embedding_functions

# 1. Initialize the client (Local Storage)
client = chromadb.PersistentClient(path="./my_database")

# 2. Create a Collection (Like a table)
# We'll use a default embedding function
collection = client.create_collection(
    name="company_docs",
    embedding_function=embedding_functions.DefaultEmbeddingFunction()
)

# 3. Add Content
collection.add(
    documents=["Our office is in New York.", "We offer unlimited PTO."],
    metadatas=[{"category": "office"}, {"category": "hr"}],
    ids=["id1", "id2"]
)

# 4. Search
results = collection.query(
    query_texts=["Where do I work?"],
    n_results=1
)

print(f"Nearest Match: {results['documents'][0]}")

4. Performance: Latency and Indexing

Vector search is fast, but "Indexing" can be slow. When you add a document, the database has to calculate where it fits in the high-dimensional map.

For a professional system, you should monitor:

Query Latency: How long does it take to find the neighbors?
Recall: Did the database actually find the most relevant document, or just a "good enough" one?

Summary

Vector DBs enable semantic search instead of keyword matching.
Chroma is for local dev; Pinecone is for managed production; Weaviate is for enterprise control.
Metadata is the secret to building high-accuracy RAG systems that can filter by date, author, or category.

In the next lesson, we will look at Context Injection, the process of taking these results and feeding them back into the LLM.

Exercise: Database Selection

You are building an AI support bot for a small startup. They have 200 help articles. They want to ship a prototype by Friday and have $0 budget for infrastructure.

Which Vector Database would you choose?
Why?

Answer Logic: Chroma. It's free, runs locally or in their existing server, and 200 articles is a very small dataset that Chroma can handle with ease. No need for complex cloud configuration!