
Batch vs Streaming Ingestion
Compare batch and streaming ingestion patterns for RAG systems and learn when to use each.
Batch vs Streaming Ingestion
Choose the right ingestion pattern based on your data update frequency and latency requirements.
Batch Ingestion
Process large volumes of data at scheduled intervals.
# Batch ingestion example
def batch_ingest_daily():
files = get_new_files_since_last_run()
for file in files:
content = process_file(file)
embed_and_store(content)
mark_ingestion_complete()
When to use:
- Initial system setup
- Periodic updates (daily, weekly)
- Large historical datasets
- Cost optimization (off-peak processing)
Streaming Ingestion
Process data in real-time as it arrives.
# Streaming ingestion example
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class DocumentHandler(FileSystemEventHandler):
def on_created(self, event):
if not event.is_directory:
process_and_index(event.src_path)
observer = Observer()
observer.schedule(DocumentHandler(), path='/data', recursive=True)
observer.start()
When to use:
- Real-time requirements (<1 min latency)
- Continuous data feeds
- Event-driven systems
- Live document collaboration
Comparison
| Aspect | Batch | Streaming |
|---|---|---|
| Latency | Hours-days | Seconds-minutes |
| Complexity | Simple | Complex |
| Cost | Lower | Higher |
| Consistency | Easier | Harder |
| Use case | Periodic updates | Real-time needs |
Hybrid Approach
# Combine both patterns
def hybrid_ingestion():
# Streaming for recent docs
stream_watcher.start()
# Batch for historical
schedule.every().day.at("02:00").do(batch_process_old_docs)
Next: File system ingestion.