
Batch vs Interactive Workloads
Optimize your infrastructure for real-time user chat vs large-scale automated data processing.
Batch vs Interactive Workloads
Not all RAG systems are "Chatbots." Depending on your use case, you might need to process data in real-time or in massive background batches.
Interactive Workloads (Real-Time)
- Goal: Low Latency.
- Example: A customer support agent asking a question.
- Reqs: Direct API access to Chroma, streaming LLM outputs.
Batch Workloads (Background)
- Goal: High Throughput & Cost Efficiency.
- Example: Analyzing 10,000 past support tickets once a week to find common issues.
- Reqs: Worker queues (like Celery or RabbitMQ), separate database instances to avoid slowing down production.
Implementation: The Task Queue
For RAG, use a task queue to handle migrations or large-scale re-indexing.
@celery.task
def ingest_massive_folder(folder_path):
# Process 10,000 files in the background
# Update the production index only when finished
Resource Isolation
Never run a massive "Batch Ingestion" (which uses 100% CPU/GPU) on the same machine that is serving "Interactive" user queries. Use separate database replicas or cloud clusters.
| Metric | Interactive | Batch |
|---|---|---|
| Latency | Critical | Irrelevant |
| Cost | Pay-per-use | Spot Instances (Cheapest) |
| Scaling | Spike-based | Constant load |
Exercises
- Why should you use "Spot Instances" for batch RAG ingestion?
- What is a "Message Queue," and how does it help with system stability?
- Design a system that handles 1,000 users chatting and a background job transcribing 500 hours of video.