
Metadata Storage: Managing Structured Data in a Vector World
Explore the internal key-value stores that power metadata filtering. Learn how vector databases index non-vector data and the performance implications of deep metadata schemas.
Metadata Storage: The Hidden Database
A common misconception is that a vector database is just "a big list of vectors." In reality, most vector databases are Dual Databases. They contain:
- The Vector Engine: For high-dimensional math.
- The Metadata Store: For structured filtering and storage.
In this lesson, we will explore the secondary storage engine that lives inside your vector database. We will learn about how systems like RocksDB or BadgerDB are used to provide lightning-fast metadata filtering, and how "High Cardinality" metadata can slow your system to a crawl.
1. Why Vectors Need a Separate Storage Engine
Vector indexes (like HNSW) are terrible at storing strings and numbers. They are optimized for distance math, not for WHERE user_id = 'abc'.
To solve this, vector databases embed a standard Key-Value Store or a Document Store alongside the vector engine.
- When you add a document, the Vector Engine indexes the embedding.
- Simultaneously, the Metadata Store indexed the JSON dictionary.
The ID Link
The connection between the two is the ID. When you query, the database finds the ID via the vector search and then "looks up" the metadata from the KV store using that ID.
2. Common Metadata Engines
Most managed vector databases built on top of Open Source use battle-tested KV stores as their metadata backbone:
- RocksDB: Used by many (like Milvus and TiDB) for its high-performance SSD optimization.
- SQLite: Used by local-first DBs like Chroma for simplicity.
- Elasticsearch/Lucene: Used by OpenSearch to handle complex text filters alongside vectors.
- RocksDB/BadgerDB: Low-level "LSM-Tree" based stores that are very fast at writes.
3. Metadata Indexing: Inverted Indexes for Logic
How does a vector database find "all documents where price < 50" without scanning every row? It builds an Inverted Index (just like we discussed in Module 1 for keywords).
- Attribute
Price:0 - 10: IDs [1, 44, 98]10 - 50: IDs [2, 5, 88]
When you run a filtered search, the database retrieves the Bitset (a list of 1s and 0s) representing the matching IDs and uses it as a mask for the vector search.
graph TD
A[Filter: 'Color=Red'] --> B[Metadata Index]
B --> C[Bitset: 1011001]
D[Query Vector] --> E[Vector Index]
E --> F[Unfiltered Results]
C & F --> G[Intersect]
G --> H[Final Filtered Results]
4. The Cardinality Problem
As an AI engineer, you must be careful about what you put in your metadata.
What is Cardinality?
Cardinality is the "uniqueness" of a field.
- Low Cardinality:
is_active(only 2 possible values). - High Cardinality:
timestamp_msorunique_user_guid(millions of values).
The Performance Impact:
If you index a high-cardinality field like timestamp, the Metadata Store has to maintain a massive index that might grow larger than the Vector Index itself. This consumes the RAM that should be used for your vector searches, leading to "Paging" and massive slowdowns.
5. Python Concept: Simulating Metadata Performance
Let's see why filtering the metadata separately from the vector search is important for memory management.
# A conceptual example of a Metadata Index
import sys
# High Cardinality (Worst Case)
metadatas = {f"id_{i}": {"timestamp": i} for i in range(1000000)}
# Low Cardinality (Best Case)
categories = {"active": [f"id_{i}" for i in range(500000)],
"inactive": [f"id_{i}" for i in range(500000)]}
# In-memory size check
size_h = sys.getsizeof(metadatas)
size_l = sys.getsizeof(categories)
print(f"High Cardinality Index Size: {size_h / 1024 / 1024:.2f} MB")
print(f"Low Cardinality Index Size: {size_l / 1024 / 1024:.2f} MB")
# Search simulation
# Finding 'active' is O(1) in the categories dict.
# Finding a specific timestamp range is O(log n) but the index is 10x larger.
6. Best Practices for Metadata Schemas
- Only index what you filter by: If you just need to display the "Author Name" in the UI, store it in the metadata but don't tell the database to "Index" that field.
- Normalize Strings: Instead of storing full text in metadata, store Category IDs.
- Use Booleans for Speed: Bitwise logic on booleans is the fastest metadata operation.
- Be Wary of Large Payloads: If you store a 20-page transcript in the metadata of every vector, your database exports will be massive and slow. Better to store a URL or a foreign key to a standard SQL database.
Summary and Key Takeaways
Metadata storage is the unsung hero of the vector database.
- Dual-engine architecture: Vector math and Boolean logic live together but run separately.
- Metadata Stores use KV engines like RocksDB or SQLite.
- Cardinality is everything: High-cardinality metadata is a silent performance killer.
- Index selectively: Only index fields used in
WHEREclauses (filters).
In the next lesson, we wrap up Module 4 by looking at Scaling and Sharding, exploring how we distribute this dual-engine architecture across multiple physical servers.
Exercise: Schema Optimization
You are building a "Corporate Memo Search." Each memo has:
memo_id(UUID)content_vectorauthor_name(String)department_id(Integer)is_secret(Boolean)word_count(Integer)
- Which of these fields should be Indexed in the metadata store?
- Which should just be Stored (for display only)?
- If you frequently search by "Memos written in the last hour," should you index the
created_attimestamp? What is the risk?