LLM Pipelines for Fact Extraction: The Intelligence Engine

While classic NLP (Lesson 1) is great for structure, it lacks Wisdom. It can't distinguish between a "Sarcastic" relationship and a "Factual" one. It can't infer that "Sudeep" and "The main developer" are the same person based on a complex narrative arc. For this, we need the LLM Pipeline.

In this lesson, we will move past simple search and into Informational Mining. We will learn how to build recursive LLM pipelines that read a document multiple times—once for entities, once for relationships, and once for metadata. We will explore the "Step-by-Step" prompting strategy that ensures high-fidelity graph construction.

1. The Multi-Pass Extraction Strategy

Giving an LLM a 20-page document and saying "Generate the graph" is a recipe for failure. The model will hallucinate, skip details, and lose the schema.

The 3-Pass Method:

Entity Pass: "Extract all unique PERSON, ORG, and PROJECT entities."
Relationship Pass: "Given the list of entities from Pass 1, find all logical connections between them."
Attribute Pass: "For each entity and relationship, find the associated metadata (dates, budgets, descriptions)."

By breaking it down, you allow the LLM to focus its "Attention" (the Context window) on one specific task at a time.

2. Structured Output: The JSON/XML Constraint

If an LLM returns a conversational answer, your code can't use it. You must enforce a strict schema.

Tools: Use "JSON Mode" (OpenAI) or "Constrained Generation" (Guidance, Outlines).
The Schema Definition: You must provide the exact expected JSON keys in your system prompt.

{
  "entity": "Project Valkyrie",
  "type": "Project",
  "relationships": [
    {"target": "Sudeep", "type": "LEADS", "confidence": 0.98}
  ]
}

3. Handling "Hidden" Relationships (Inference)

The true power of an LLM in a graph pipeline is its ability to find the Implicit.

Text: "Since the new policy started, the team's velocity has decreased."
LLM Extraction: (Policy) -[:NEGATIVELY_IMPACTS]-> (Team Velocity)

A classic NLP model would never find this because the word "Impacts" isn't in the text. The LLM infers the relationship from the narrative.

graph TD
    Raw[Document Chunk] --> P1[Pass 1: Entity Extractor]
    P1 -->|List| P2[Pass 2: Relation Matcher]
    P2 -->|Triplets| P3[Pass 3: Metadata Enricher]
    P3 --> Final[Structured Cypher Export]
    
    style P2 fill:#4285F4,color:#fff
    style Final fill:#34A853,color:#fff

4. The "Sliding Window" Problem

What happens when a fact starts in Chunk A and ends in Chunk B?

Overlap: When you chunk your documents, overlap them by 15-20%.
Context Loading: When extracting from Chunk B, give the LLM the list of "Recently Extracted Entities" from Chunk A. This allows the model to "Reconnect" the thread across boundaries.

5. Implementation: A Recursive Extraction Loop with Python

Let's look at how we can link two extraction steps together.

def extract_graph_from_text(text):
    # STEP 1: Identification
    entities = llm_call(f"List all people and companies in: &lbrace;text&rbrace;")
    
    # STEP 2: Connection
    # We pass the list of entities BACK into the prompt
    graph_data = llm_call(f"""
        Use these entities: &lbrace;entities&rbrace;.
        Find their connections from this text: &lbrace;text&rbrace;.
        Format as (A)-[B]->(C).
    """)
    
    return graph_data

# The "Stateful" extraction prevents the LLM from making
# up new names for the same person in Step 2.

6. Summary and Exercises

LLMs are the "Smarter Researchers" in your graph factory.

Multi-pass processing increases accuracy.
Structured output (JSON) is mandatory for database ingestion.
Inference allows you to capture relationships that aren't explicitly stated.
Context memory between chunks prevents graph fragmentation.

Exercises

Prompt Design: Write a 5trace system prompt for an LLM that extracts relationships from a medical report. Include 3 specific rules (e.g., "Do not extract family members as medical entities").
Inference Challenge: Look at a weather report. What is an "Implicit" relationship between "Hurricane" and "Flight Cancellations" that an LLM would find but a keyword search would miss?
The Overlap Test: If you have two chunks: "Sudeep started the..." and "...work on Monday," why is 50-token overlap better than 0-token overlap?

In the next lesson, we will look at a different kind of source: Handling Tabular and Structured Data.