Namespace and Collection Design

In Chroma, a Collection is conceptually similar to a "Table" in SQL or a "Folder" on your desktop. How you organize your documents into collections significantly impacts the speed, security, and scalability of your RAG system.

One Big Collection vs. Many Small Ones

The "One Big Collection" Approach

Pros: Simplified management; cross-topic search works out-of-the-box.
Cons: Slower indexing; higher latency if you have millions of records; harder to isolate data.

The "Siloed" Approach (Many Small Collections)

Pros: Fast search (smaller search space); easier to delete or re-index specific sets of data; strong data isolation.
Cons: Cross-collection search requires multiple queries; higher administrative overhead.

Common Design Patterns

By Document Modality: Store all Images in one collection and all Transcripts in another. Use these together during retrieval.
By Department/Tenant: Each company or department (HR, Legal, Finance) gets its own collection for security.
By Time: Create monthly collections (e.g., docs_2023_Q4) for data that has a strictly temporal relevance.

Naming Conventions

Always use a consistent naming schema for collections:

<tenant_id>_<modality>_<version>
Example: acme_inc_text_v2

Implementation Scenario

If you have a multimodal RAG system for a university, you might design it like this:

# Create specialized collections
lectures = client.create_collection("lecture_transcripts")
slides = client.create_collection("lecture_slides")
assignments = client.create_collection("assignments")

At query time, you can decide whether to search one or all of them.

Handling Re-Indexing

If you change your embedding model, you must recreate your collections. By using versioned names (e.g., v1 to v2), you can perform a "Blue/Green" deployment:

Index your data into a new v2 collection.
Test the new RAG quality.
Switch your application's production pointer to v2.
Delete the old v1 collection.

Exercises

Why might searching two small collections be faster than searching one large one?
How would you design a collection strategy for a news app that adds 5,000 articles every day?
What are the security risks of putting "Public" and "Sensitve" data in the same collection?