The Online vs Offline Pipeline: Split Intelligence

A Graph RAG system is a factory. If you try to build your graph at the same time a user is asking a question, your system will be too slow to use. Professional architectures use the "Split Intelligence" model: an Offline Pipeline to build the knowledge and an Online Engine to serve it.

In this lesson, we will look at the System Architecture of a production Graph RAG application. We will learn how to design the Asynchronous Ingestion Worker (the Offline side) and the Low-Latency Retrieval API (the Online side). We will see how these two halves communicate and why this separation is the secret to scaling to millions of documents.

1. The Offline Pipeline: Building the World

The objective: Move data from S3, Slack, or SQL into the Knowledge Graph.

Workers: A fleet of Python containers (using Celery/RabbitMQ) that process documents 24/7.
Steps: Text Extraction -> AI Entity Mapping -> Graph Writing.
Performance: Throughput matters more than latency. It's okay if it takes 1 minute to "Learn" a document.

2. The Online Engine: Serving the Question

The objective: Return a verified answer in < 3 seconds.

API: A FastAPI or Go server that receives the request.
Retrieval: Performs the Vector search + Cypher traversal (Fast).
Synthesis: Performs the LLM reasoning (The bottleneck).
Performance: Latency is king. Every millisecond of the graph query counts.

3. The "Sync" Bridge

When does the Online engine know the Offline pipeline has added new facts?

Direct Notification: The worker pings the API (Good for real-time).
Shared State: They both look at the same Neo4j cluster (Standard).
Indexing: The Vector index must be updated after the graph nodes are created for them to be discoverable.

graph LR
    D[Raw Docs] -->|Trigger| W[Offline Workers]
    W -->|Ingest| KG[(Knowledge Graph)]
    
    U[User] -->|Question| API[Online API]
    API -->|Fetch| KG
    KG -->|Context| API
    API -->|Answer| U
    
    style W fill:#f4b400,color:#fff
    style API fill:#34A853,color:#fff

4. Implementation: A Basic Pipeline Orchestrator (Pseudo-Code)

# OFFLINE WORKER
@app.task
def process_new_document(doc_id):
    content = fetch_text(doc_id)
    triplets = extract_triplets_with_llm(content)
    write_to_neo4j(triplets)
    reindex_vector_store()

# ONLINE API
@app.get("/ask")
async def ask_question(q: str):
    node_id = find_start_node(q)
    context = run_cypher_traversal(node_id)
    return synthesize_answer(context, q)

5. Summary and Exercises

Separation of Concerns is the foundation of Reliability.

Offline workers handle the heavy, slow task of AI extraction.
Online APIs focus on high-speed retrieval and synthesis.
Parallelism: You can scale workers to process 10,000 docs per hour without slowing down a single user query.
Stability: If the Ingestion worker crashes, your users can still query the existing knowledge.

Exercises

Architecture Choice: If you have a "Breaking News" graph, should you use a "Batch" offline pipeline (runs every hour) or a "Stream" offline pipeline (runs every 10 seconds)?
The "Scale" Strategy: If your API is slow, should you add more Workers or more API instances?
Visualization: Draw a box for "Offline" and "Online." Draw an arrow representing the "Flow of Knowledge" between them.

In the next lesson, we will look at memory tiers: Cold/Warm/Hot Graph Architectures.