Batch vs Interactive Workloads

Not all RAG systems are "Chatbots." Depending on your use case, you might need to process data in real-time or in massive background batches.

Interactive Workloads (Real-Time)

Goal: Low Latency.
Example: A customer support agent asking a question.
Reqs: Direct API access to Chroma, streaming LLM outputs.

Batch Workloads (Background)

Goal: High Throughput & Cost Efficiency.
Example: Analyzing 10,000 past support tickets once a week to find common issues.
Reqs: Worker queues (like Celery or RabbitMQ), separate database instances to avoid slowing down production.

Implementation: The Task Queue

For RAG, use a task queue to handle migrations or large-scale re-indexing.

@celery.task
def ingest_massive_folder(folder_path):
    # Process 10,000 files in the background
    # Update the production index only when finished

Resource Isolation

Never run a massive "Batch Ingestion" (which uses 100% CPU/GPU) on the same machine that is serving "Interactive" user queries. Use separate database replicas or cloud clusters.

Metric	Interactive	Batch
Latency	Critical	Irrelevant
Cost	Pay-per-use	Spot Instances (Cheapest)
Scaling	Spike-based	Constant load

Exercises

Why should you use "Spot Instances" for batch RAG ingestion?
What is a "Message Queue," and how does it help with system stability?
Design a system that handles 1,000 users chatting and a background job transcribing 500 hours of video.