Metadata Storage: The Hidden Database

A common misconception is that a vector database is just "a big list of vectors." In reality, most vector databases are Dual Databases. They contain:

The Vector Engine: For high-dimensional math.
The Metadata Store: For structured filtering and storage.

In this lesson, we will explore the secondary storage engine that lives inside your vector database. We will learn about how systems like RocksDB or BadgerDB are used to provide lightning-fast metadata filtering, and how "High Cardinality" metadata can slow your system to a crawl.

1. Why Vectors Need a Separate Storage Engine

Vector indexes (like HNSW) are terrible at storing strings and numbers. They are optimized for distance math, not for WHERE user_id = 'abc'.

To solve this, vector databases embed a standard Key-Value Store or a Document Store alongside the vector engine.

When you add a document, the Vector Engine indexes the embedding.
Simultaneously, the Metadata Store indexed the JSON dictionary.

The ID Link

The connection between the two is the ID. When you query, the database finds the ID via the vector search and then "looks up" the metadata from the KV store using that ID.

2. Common Metadata Engines

Most managed vector databases built on top of Open Source use battle-tested KV stores as their metadata backbone:

RocksDB: Used by many (like Milvus and TiDB) for its high-performance SSD optimization.
SQLite: Used by local-first DBs like Chroma for simplicity.
Elasticsearch/Lucene: Used by OpenSearch to handle complex text filters alongside vectors.
RocksDB/BadgerDB: Low-level "LSM-Tree" based stores that are very fast at writes.

3. Metadata Indexing: Inverted Indexes for Logic

How does a vector database find "all documents where price < 50" without scanning every row? It builds an Inverted Index (just like we discussed in Module 1 for keywords).

Attribute Price:
- 0 - 10: IDs [1, 44, 98]
- 10 - 50: IDs [2, 5, 88]

When you run a filtered search, the database retrieves the Bitset (a list of 1s and 0s) representing the matching IDs and uses it as a mask for the vector search.

graph TD
    A[Filter: 'Color=Red'] --> B[Metadata Index]
    B --> C[Bitset: 1011001]
    D[Query Vector] --> E[Vector Index]
    E --> F[Unfiltered Results]
    C & F --> G[Intersect]
    G --> H[Final Filtered Results]

4. The Cardinality Problem

As an AI engineer, you must be careful about what you put in your metadata.

What is Cardinality?

Cardinality is the "uniqueness" of a field.

Low Cardinality: is_active (only 2 possible values).
High Cardinality: timestamp_ms or unique_user_guid (millions of values).

The Performance Impact: If you index a high-cardinality field like timestamp, the Metadata Store has to maintain a massive index that might grow larger than the Vector Index itself. This consumes the RAM that should be used for your vector searches, leading to "Paging" and massive slowdowns.

5. Python Concept: Simulating Metadata Performance

Let's see why filtering the metadata separately from the vector search is important for memory management.

# A conceptual example of a Metadata Index
import sys

# High Cardinality (Worst Case)
metadatas = {f"id_{i}": {"timestamp": i} for i in range(1000000)}

# Low Cardinality (Best Case)
categories = {"active": [f"id_{i}" for i in range(500000)], 
              "inactive": [f"id_{i}" for i in range(500000)]}

# In-memory size check
size_h = sys.getsizeof(metadatas)
size_l = sys.getsizeof(categories)

print(f"High Cardinality Index Size: {size_h / 1024 / 1024:.2f} MB")
print(f"Low Cardinality Index Size: {size_l / 1024 / 1024:.2f} MB")

# Search simulation
# Finding 'active' is O(1) in the categories dict.
# Finding a specific timestamp range is O(log n) but the index is 10x larger.

6. Best Practices for Metadata Schemas

Only index what you filter by: If you just need to display the "Author Name" in the UI, store it in the metadata but don't tell the database to "Index" that field.
Normalize Strings: Instead of storing full text in metadata, store Category IDs.
Use Booleans for Speed: Bitwise logic on booleans is the fastest metadata operation.
Be Wary of Large Payloads: If you store a 20-page transcript in the metadata of every vector, your database exports will be massive and slow. Better to store a URL or a foreign key to a standard SQL database.

Summary and Key Takeaways

Metadata storage is the unsung hero of the vector database.

Dual-engine architecture: Vector math and Boolean logic live together but run separately.
Metadata Stores use KV engines like RocksDB or SQLite.
Cardinality is everything: High-cardinality metadata is a silent performance killer.
Index selectively: Only index fields used in WHERE clauses (filters).

In the next lesson, we wrap up Module 4 by looking at Scaling and Sharding, exploring how we distribute this dual-engine architecture across multiple physical servers.

Exercise: Schema Optimization

You are building a "Corporate Memo Search." Each memo has:

memo_id (UUID)
content_vector
author_name (String)
department_id (Integer)
is_secret (Boolean)
word_count (Integer)

Which of these fields should be Indexed in the metadata store?
Which should just be Stored (for display only)?
If you frequently search by "Memos written in the last hour," should you index the created_at timestamp? What is the risk?

Metadata Storage: Managing Structured Data in a Vector World