Batch vs Streaming Ingestion

Batch vs Streaming Ingestion

Compare batch and streaming ingestion patterns for RAG systems and learn when to use each.

Batch vs Streaming Ingestion

Choose the right ingestion pattern based on your data update frequency and latency requirements.

Batch Ingestion

Process large volumes of data at scheduled intervals.

# Batch ingestion example
def batch_ingest_daily():
    files = get_new_files_since_last_run()
    
    for file in files:
        content = process_file(file)
        embed_and_store(content)
    
    mark_ingestion_complete()

When to use:

  • Initial system setup
  • Periodic updates (daily, weekly)
  • Large historical datasets
  • Cost optimization (off-peak processing)

Streaming Ingestion

Process data in real-time as it arrives.

# Streaming ingestion example
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class DocumentHandler(FileSystemEventHandler):
    def on_created(self, event):
        if not event.is_directory:
            process_and_index(event.src_path)

observer = Observer()
observer.schedule(DocumentHandler(), path='/data', recursive=True)
observer.start()

When to use:

  • Real-time requirements (<1 min latency)
  • Continuous data feeds
  • Event-driven systems
  • Live document collaboration

Comparison

AspectBatchStreaming
LatencyHours-daysSeconds-minutes
ComplexitySimpleComplex
CostLowerHigher
ConsistencyEasierHarder
Use casePeriodic updatesReal-time needs

Hybrid Approach

# Combine both patterns
def hybrid_ingestion():
    # Streaming for recent docs
    stream_watcher.start()
    
    # Batch for historical
    schedule.every().day.at("02:00").do(batch_process_old_docs)

Next: File system ingestion.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn