Filtering and Metadata: Math meets Logic

In the real world, "Similar" is rarely enough. A user doesn't just want a "similar product"; they want a "similar product that is in stock, costs under $50, and is available in blue."

This is where Metadata comes in. Metadata is the "Structured Data" (JSON or key-value pairs) that you attach to your vectors during indexing.

In this lesson, we will explore how vector databases handle the marriage of Unstructured Vector Math and Structured Boolean Logic, and why the sequence of these operations (Filtering vs. Searching) defines the reliability of your system.

1. What is Metadata in a Vector DB?

A vector database entry typically consists of three parts:

ID: A unique string or integer.
Vector: The floating-point embedding (meaning).
Metadata: A JSON-like dictionary containing attributes (logic).

Example Metadata:

{
  "user_id": "user_456",
  "document_type": "legal_contract",
  "is_confidential": true,
  "last_updated": "2026-01-05",
  "language": "en"
}

2. Pre-filtering vs. Post-filtering

This is the most important architectural concept in metadata management.

Post-filtering (The Slow/Broken Way)

You perform a vector search for the top 100 neighbors.
You then look at the metadata of those 100 results and delete any that don't match your filter (e.g., user_id != active_user).
The Problem: If only 2 out of those 100 match your filter, the user gets 2 results instead of the 100 they asked for. If zero match, the search returns nothing, even though there might be 1,000 other matching documents slightly further away in the vector space.

Pre-filtering (The Production Way)

You apply the filter first (e.g., the database identifies all points where user_id == active_user).
You then perform the vector search only within that subset.
The Result: You are guaranteed to get the top-k results that meet your criteria.

graph TD
    subgraph Post_Filtering
    A[Search Top 100] --> B[Filter for 'Blue']
    B --> C[Result: Maybe 5 items]
    end
    subgraph Pre_Filtering
    D[Identify 'Blue' Items] --> E[Search Top 100 in that list]
    E --> F[Result: Guaranteed 100 items]
    end

Dedicated vector databases like Pinecone, Chroma, and OpenSearch use sophisticated Pre-filtering logic.

3. Metadata Filtering for Multi-tenancy

Identity and Access Management (IAM) is the most critical use case for metadata. If you are building a SaaS app where millions of users store their data in one database, you must filter every query by org_id or user_id.

If you forget this, a user might ask "Show me my bank statements," and the vector search might retrieve another user's statements because they are semantically similar.

Production Query Pattern:

# Pinecone example
index.query(
    vector=embedding,
    top_k=5,
    filter={
        "org_id": {"$eq": "acme_inc"},
        "access_level": {"$in": ["admin", "editor"]}
    }
)

4. Range Filtering and Advanced Logic

Modern vector databases have moved beyond simple "equals" checks. You can now perform:

Range Queries: {"price": {"$lt": 50}}
Set Membership: {"category": {"$in": ["shoes", "clothing"]}}
Boolean Logic: {"$and": [{"status": "published"}, {"$or": [...]}]}

These filters allow you to treat your vector database as a hybrid of a search engine and a relational database.

5. Python Example: Implementing Filters in Chroma

Let's see how easy it is to add logic to your vector search using ChromaDB.

import chromadb

# 1. Initialize local client
client = chromadb.Client()
collection = client.create_collection("my_docs")

# 2. Add data with metadata
collection.add(
    embeddings=[[0.1, 0.1], [0.1, 0.2], [0.9, 0.9]],
    documents=["Policy A", "Policy B", "Privacy Policy"],
    metadatas=[
        {"dept": "HR", "year": 2024},
        {"dept": "IT", "year": 2024},
        {"dept": "Legal", "year": 2023}
    ],
    ids=["id1", "id2", "id3"]
)

# 3. Query with a Metadata Filter
results = collection.query(
    query_embeddings=[[0.1, 0.1]],
    n_results=5,
    where={"dept": "HR"} # <--- This is our Pre-filter
)

print(f"Results for HR filter: {results['documents']}")

# 4. Query with logical operator (Where and)
results_alt = collection.query(
    query_embeddings=[[0.1, 0.1]],
    n_results=5,
    where={
        "$and": [
            {"year": {"$gte": 2024}},
            {"dept": {"$ne": "Legal"}}
        ]
    }
)
print(f"Results for complex filter: {results_alt['documents']}")

6. The Cost of Metadata (Cardinality)

While metadata is powerful, adding too many attributes can slow down your database.

High Cardinality Attributes

A "High Cardinality" attribute is one where almost every row has a unique value (like a timestamp or a unique_request_id). If the database has to manage a separate index for 10 million unique IDs + the vector index, it can consume massive amounts of disk and memory.

Best Practice: Only index metadata fields that you actually plan to filter by. If you only want to display a piece of info (like the author's name) but never search/filter by it, store it in the metadata but don't define it as an indexed field in your database configuration.

Summary and Module 3 Wrap-up

Metadata is the "Glue" that makes vector search safe and business-relevant.

Metadata adds context and structure to unstructured vectors.
Pre-filtering is mandatory for reliable search results at scale.
Multi-tenancy depends on metadata filtering to ensure data isolation.
Boolean Logic allows you to perform range and set-based queries alongside vector similarity.

Module 3 Wrap-up

You have completed the core theory of Vector Search Concepts. You understand Neighbors, ANN search, Indexing (HNSW/IVF), Recall tuning, and Metadata.

In Module 4: Vector Database Architecture, we will look at how these engines are actually built. We will explore the storage layers, query partitions, and the hardware that makes everything run smoothly.

Exercise: Schema Design

You are building a "Real Estate AI" search engine. A user says: "Find me a house like this one [Photo], but it must have 3 bedrooms, be in San Francisco, and cost less than $2M."

Identify which parts of this query are Vector-based.
Identify which parts are Metadata-based (Logical filters).
Design a JSON metadata schema for a "House" object that supports this query.

Think about which fields should be indexed for filtering and which should just be "blobs" for display.

Congratulations on finishing Module 3! Forward to Module 4.

Filtering and Metadata: Combining Math with Logic