Natural Language Processing (NLP) for Ingestion: The First Filter

To build a graph, you must first "Read" the data. But reading for a graph is different than reading for a vector. We aren't just looking for "Keywords"; we are looking for Syntax. We need to know who the Subject is, what the Verb is, and who the Object is.

In this lesson, we will revisit the classic NLP tools that power the first stage of the ingestion pipeline. We will explore Tokenization, Dependency Parsing, and Coreference Resolution. We will see why these non-LLM techniques are still essential for pre-processing large volumes of data before they ever hit an expensive Large Language Model.

1. Dependency Parsing: The Structure of a Sentence

A Knowledge Graph triplet (S) -> [P] -> (O) is essentially a Syntactic Path.

Sentence: "The red car belongs to Sudeep."
Dependency Map:
- car (Noun/Subject)
- belongs (Verb/Predicate)
- Sudeep (Noun/Object)

By using an NLP library (like SpaCy or Stanza), you can automatically identify these linguistic "Triplets" without an LLM. This is significantly cheaper and faster for the first pass of an ingestion pipeline.

2. Coreference Resolution: Solving the "He/She/It" Problem

In a 10-page document, the name "Project Titan" might only be mentioned once. After that, the author uses words like "It," "The project," or "The initiative."

A Vector RAG system treats "It" as a meaningless pronoun. A Graph RAG system uses Coreference Resolution to map all those "Its" back to the Project Titan node ID.

Without this step, your graph will have thousands of isolated facts about an entity named "It" that is connected to nothing.

3. The "Entity Salience" Score

Not every word is an entity. NLP models can assign a Salience Score to a word, based on how "Central" it is to the sentence or document.

High Salience: "The President" (The main subject).
Low Salience: "The Tuesday" (Just a background detail).

By filtering for high-salience entities, you prevent your graph from becoming a "Noisy Haystack."

graph TD
    Raw[Raw Sentence] --> T[Tokenizer]
    T --> P[POS Tagger]
    P --> D[Dependency Parser]
    D -->|Export| T1((Subject))
    D -->|Export| T2((Verb))
    D -->|Export| T3((Object))
    
    style T1 fill:#4285F4,color:#fff
    style T3 fill:#34A853,color:#fff

4. Why We Still Need Classic NLP in an LLM World

You might ask: "Why not just give the raw text to Gemini and ask for the graph?"

Cost: Processing 1TB of text through a top-tier LLM is prohibitively expensive.
Latency: Classic NLP is 100x faster.
Accuracy in Cleaning: LLMs can sometimes "Clean" text so much that they change the meaning. Classic NLP keeps the structure "Ground Truth" while removing the fluff.

5. Implementation: Exploring Dependencies with SpaCy

Let's see how we can programmatically identify the "Skeleton" of a sentence.

import spacy

# Load the transformer-based model for better accuracy
nlp = spacy.load("en_core_web_trf")

text = "Sudeep lead the development of the new Graph RAG course."

def parse_sentence(text):
    doc = nlp(text)
    for token in doc:
        # We look for the 'nsubj' (Subject) and its 'head' (Verb)
        print(f"Token: {token.text:12} | Tag: {token.dep_:10} | Head: {token.head.text}")

# RUN
parse_sentence(text)

# LOGIC:
# Sudeep is 'nsubj' (Subject)
# lead is the 'ROOT' (Verb)
# development is 'dobj' (Direct Object)

6. Summary and Exercises

NLP is the "Filter" that prepares raw data for the graph store.

Dependency Parsing extracts the raw (S, V, O) structure.
Coreference Resolution unifies descriptors ("He") with entities ("Sudeep").
Salience helps you ignore the background noise.
Hybrid Pipelines use NLP for speed and LLMs for reasoning.

Exercises

Coreference Challenge: Write a 3-sentence story about a cat. How many times did you use the word "it" or "the animal"? How would a graph represent this without coreference resolution?
Dependency Visualization: Go to Explosion.ai's DisplaCy. Type in a complex sentence. Can you identify the "Verb" that would become the Edge Type?
Speed Test: If an LLM takes 5 seconds to process a paragraph and SpaCy takes 0.01 seconds, how long does each take to process a 100,000-page archive?

In the next lesson, we will look at the higher-level engine: LLM Pipelines for Fact Extraction.