Collection and Namespace Design

In the previous lessons, we used a single collection called "my_first_collection." But in a production application—like a multi-department AI assistant—you might have millions of documents across dozens of topics. How do you organize them?

In Chroma (and most vector databases), the Collection is the primary unit of organization. It is the equivalent of a "Table" in SQL or a "Folder" in a file system.

In this lesson, we will explore the Schema Design for vector data. We will discuss when to split data into separate collections versus merging them with metadata filters, and how to name your collections for enterprise scale.

1. What is a Collection?

A collection is a logical grouping of vectors that share the same Embedding Model and the same Distance Metric.

Crucial Constaint: You cannot mix vectors of different dimensions in the same collection.

If you have 384D vectors from a local model.
If you have 1536D vectors from OpenAI.
They must be in separate collections.

2. One Big Collection vs. Many Small Collections

This is the "Architect's Dilemma" of vector search.

Option A: The "Monolith" (One Big Collection)

You put every document from every department (HR, Finance, IT) into one collection. You use metadata tags (dept: "HR") to separate them.

Pros: Simple to manage; can search "everything" easily.
Cons: Slower search (searching 10M rows instead of 1M); metadata filtering overhead.

Option B: The "Sharded" (Many Small Collections)

You create a separate collection for finance_docs, hr_docs, etc.

Pros: Blazing fast search; physically separate data for security.
Cons: Harder to search across departments; more managing of separate index files.

graph TD
    subgraph Monolith
    A[Global Collection] --> B[HR Metadata]
    A --> C[IT Metadata]
    end
    subgraph Sharded
    D[HR Collection]
    E[IT Collection]
    end

3. Best Practices for Namespace Design

To avoid "Collection Sprawl," implement a naming convention early.

Pattern: [environment].[domain].[version]

prod.user_profiles.v1
staging.company_handbook.v2
test.temp_index_20260105

Why Versioning in the Name?

Vector models update. If you use OpenAI v2 embeddings today and want to switch to v3 tomorrow, you cannot "update" the existing collection. You must create a new one, re-index the data, and update your code to point to the new collection name.

4. Multi-tenancy Patterns

If you are a SaaS provider with 1,000 corporate clients, how do you store their vectors?

Category 1: One Collection per Tenant (e.g., client_acme, client_globex).
- Best for: Strict security; if clients have different embedding requirements.
Category 2: Shared Collection with Metadata Filter (WHERE org_id = 'acme').
- Best for: Scaling to thousands of small clients; significantly lower infrastructure cost (only one HNSW index).

5. Python Example: Dynamic Collection Management

Here is how you can build a wrapper to manage multiple "Project-based" collections dynamically in Chroma.

import chromadb

class VectorOrganizer:
    def __init__(self, path):
        self.client = chromadb.PersistentClient(path=path)
        
    def get_project_collection(self, project_id, model_ver="v1"):
        # Pattern: [project_id]_[model_ver]
        name = f"proj_{project_id}_{model_ver}"
        return self.client.get_or_create_collection(name=name)

    def list_all_projects(self):
        return [c.name for c in self.client.list_collections()]

# Usage
manager = VectorOrganizer("./enterprise_db")
marketing_docs = manager.get_project_collection("marketing")
finance_docs = manager.get_project_collection("finance")

print(f"Active Collections: {manager.list_all_projects()}")

6. Metadata as a "Sub-Namespace"

Sometimes you want a collection to behave like it's sharded, but keep it in one index. You can achieve this by using a reserved metadata key like namespace.

collection.query(
    query_texts=["quarterly report"],
    where={"namespace": "finance_dept"}
)

This acts as a "Virtual Collection." While mathematically the search happens in the broad space, the results are limited to the namespace, giving you the best of both worlds.

Summary and Key Takeaways

Organization is the key to maintainable AI systems.

Match your models: One collection = One embedding model dimension.
Shard by Domain: Create separate collections for radically different data types (e.g., Code vs. Documentation).
Use Versioning: Always include a version tag in your collection names to handle model upgrades.
Metadata for Tenants: Use metadata filtering to separate users within a large shared collection to save resources.

In the next lesson, we wrap up Module 5 with a Hands-on Exercise, where you will build a complete, local semantic search index from raw text files.

Exercise: Schema Planning

You are building an AI tool for a Global Law Firm. They have:

500,000 Legal Cases.
50,000 Employee Profiles.
10,000 Internal HR Memos.

Should you combine Case files and HR Memos into one collection? Why or why not?
If the Law Firm uses English and French, would you use one collection or two? (Hint: Think about the embedding model).
Propose a naming convention for their collections.

Collection and Namespace Design: Organizing Your Vector Data