Handling Versioned Data in Vector Search

In a simple RAG (Retrieval-Augmented Generation) system, we assume there is one "Truth" for every document. In production, this is rarely true. You might have:

Product V1 Documentation and Product V2 Documentation.
Amended Legal Contracts.
Drafts vs. Published versions of the same article.

If you simply index everything, your vector search will return both the old and new versions together, causing the LLM to hallucinate or provide outdated information.

In this lesson, we will explore the concepts of Semantic Versioning for vectors, the "Active Flag" pattern, and how to use metadata to ensure that users only see the version of "Truth" that applies to them.

1. The Duplicate Meaning Problem

If you have two documents that are 95% identical (but the new one fixes a critical security bug), their vectors will be extremely close in the vector space.

When a user asks "How do I secure my app?", the vector search will likely return both versions.

V1: "Use insecure method X." (Score: 0.98)
V2: "Use secure method Y." (Score: 0.97)

The LLM is now confused. It sees two conflicting commands.

2. Strategy 1: The "Active" Flag (The Standard)

The simplest and most effective way to handle versions is to use a boolean metadata flag called is_active.

How it works:

When you insert a new version of a document, you give it a unique ID (e.g., doc_123_v2).
You set is_active: true on the new version.
You find the old version (doc_123_v1) and set is_active: false.
The Query: You always include filter={"is_active": True} in your queries.

Benefit: You keep the history of the document for audits, but the "Live" AI system only sees the latest truth.

3. Strategy 2: Version-Specific Namespaces

If you have major releases (e.g., "Documentation for Software 2024" vs. "Documentation for Software 2025"), you should use Namespaces (Module 6).

Namespace: v2024
Namespace: v2025

The Query: The user chooses their software version in the UI, and your Python code directs the vector search to the corresponding namespace. This provides Physical Isolation between versions.

4. Strategy 3: Deterministic ID Overwriting

If you truly don't need the history, use the same ID (e.g., doc_123). When you upsert the new version, the old one is deleted and replaced (Module 8, Lesson 2).

Risk: If your ingestion script fails halfway through, you might lose data or have a "Version Mismatch" across your index. Overwriting is simple but "Destructive."

5. Python Example: Implementing the "Active Version" Pattern

Let's build a helper function that manages the transition between versions in Chroma.

import chromadb

client = chromadb.PersistentClient(path="./versioned_db")
collection = client.get_or_create_collection("docs")

def upsert_new_version(doc_id, content, version_num):
    # 1. De-activate all previous versions of this doc_id
    # Note: We use metadata matching to find them
    collection.update(
        where={"base_id": doc_id},
        ids=None, # Update by filter (If the DB supports it)
        metadatas={"is_active": False}
    )
    
    # 2. Add the new version
    unique_id = f"{doc_id}_v{version_num}"
    collection.add(
        documents=[content],
        ids=[unique_id],
        metadatas={
            "base_id": doc_id,
            "version": version_num,
            "is_active": True
        }
    )
    print(f"Updated {doc_id} to Version {version_num}")

# Usage
upsert_new_version("policy_001", "This is old policy", 1)
upsert_new_version("policy_001", "This is NEW policy", 2)

# 3. Query only the ACTIVE ones
results = collection.query(
    query_texts=["policy"],
    where={"is_active": True}
)

6. The "Latest" Metadata Pattern

If you need to support multiple versions but want a "Default," you can use a metadata tag called is_latest.

When a user asks a general question, you filter by is_latest: True. But if the user specifically asks "How did this policy work in 2022?", you can filter by version: 2022 to "Travel back in time" within your search engine.

Summary and Key Takeaways

Versioning is how we manage the "Age" of AI knowledge.

Avoid Duplicates: Don't let two versions of the same fact confuse your LLM.
The "Active" Flag is the industry standard for safe transitions.
Namespaces are best for major, distinct versions (e.g., year-over-year data).
Deteriministic IDs are for simplicity, but carry the risk of data loss.

In the next lesson, we wrap up Module 8 with Re-indexing Strategies, exploring what to do when your whole database needs a "Fresh Start."

Exercise: Version Control Logic

You are building a "Patient Record" system.

Doctors add new notes every day.
A note can be "Amended" by the hospital.

Should an "Amended" note overwrite the original, or should both coexist with an is_active flag? (Hint: Think about legal liability).
If you want the AI to summarize the "Last 3 versions" of a patient's history, how would you structure your metadata query?
Design a Namespace strategy for a company that has different policies for "Full-time Employees" and "Contractors."

Handling Versioned Data: Managing the Evolution of Knowledge