
Metadata Schemas for RAG
Design robust metadata schemas to enhance filtering, retrieval, and traceability in multimodal RAG systems.
Metadata Schemas for RAG
Metadata is the "secret sauce" of high-performance RAG. While vector search finds semantic similarity, metadata filtering ensures precision, security, and traceability.
The Role of Metadata
In a production system, you rarely want a "naive" search. You usually want to search with constraints:
- "Find info in the 2024 reports only."
- "Show me results from the Legal department."
- "Only include documents with High confidence scores."
Standard Metadata Fields
Every chunk in your vector database should have a set of standard metadata fields.
Core Fields
document_id: Unique identifier for the source document.chunk_id: Unique identifier for this specific piece of text.source_uri: Path or URL to the original file.page_number: If applicable, for easy citation.
Semantic Fields
topic: The primary subject matter (e.g., "Finance").keywords: A list of tags extracted from the chunk.summary: A short description of the chunk content.
Operational Fields
ingestion_timestamp: When the data was processed.model_version: The version of the embedding model used.access_control_list: User groups permitted to see this result.
Designing a Multimodal Schema
When dealing with images or audio, your schema grows:
{
"doc_id": "vid_01",
"modality": "video_frame",
"timestamp": "00:04:15",
"ocr_text": "Slide 5: Q3 Projections",
"visual_description": "A bar chart showing revenue growth",
"confidence_score": 0.98
}
Implementation with Pydantic
In Python, define your schema using Pydantic to ensure type safety:
from pydantic import BaseModel, Field
from typing import List, Optional
class RAGMetadata(BaseModel):
source: str
page: int
modality: str = "text"
department: Optional[str] = None
tags: List[str] = Field(default_factory=list)
is_confidential: bool = False
Metadata for Traceability
Citations are the primary way users trust RAG systems. Your schema must include enough information to reconstruct a source link or reference.
| Field | Purpose | Example |
|---|---|---|
author | Attribution | "John Doe" |
date_published | Freshness check | "2023-11-01" |
version | Handling updates | "v2.1" |
Exercises
- Define a metadata schema for a RAG system that indexes technical documentation.
- What fields would you add to track the "freshness" of the information?
- How would you store access permissions (e.g., "Admin", "User") in the metadata?