Handling Tabular and Structured Data: CSV to Graph

Not all knowledge comes from PDFs. In fact, some of the most "Truthful" data inside a company lives in Tables—CSVs, SQL Databases, and Excel sheets. However, tables are the opposite of graphs; they are flat, compartmentalized, and rigid.

To build a comprehensive Graph RAG system, you must learn how to "Pivot" these tables into a network. In this lesson, we will explore the Mapping Pattern. We will learn how to turn a database Row into a Node and a Foreign Key into an Edge. We will also learn why structured data is the perfect "Verifiable Foundation" for your Knowledge Graph.

1. The Column-to-Node Mapping Pattern

How do you turn a spreadsheet into a graph?

Rule 1: The ID Column becomes the Node. If you have a table Employees, the employee_id becomes the primary node label.

Rule 2: The Foreign Key becomes the Edge. If the Employees table has a column dept_id, that creates an edge to the Department node.

Rule 3: Non-linking columns become Properties. salary, hire_date, and first_name stay inside the node as properties.

2. Denormalization for Graph Speed

In a SQL database, you might have 5 "Linker Tables" to represent a complex relationship.

Employee -> JobHistory -> JobRole -> Permission.

In a Graph, we flatten this for the AI. We create a direct edge from Employee to Permission with a property origin: JobHistory. This allows the AI agent to answer "What can Sudeep do?" in 1 hop instead of 4 joins.

3. Merging Structured and Unstructured Knowledge

This is where Graph RAG truly shines. You can "Ground" a loose text document in a rigid SQL fact.

SQL Source: (Sudeep) -[:SALARY]-> ($100k)
PDF Source: "Sudeep is doing a great job."
Unified Graph: (Sudeep {perf: 'Great'}) -[:SALARY]-> ($100k)

By linking the person mention in the PDF to the unique ID in the SQL database, you give your AI agent Superpowers. It can now answer: "What is the salary of the person the manager praised in the June Report?"

graph LR
    subgraph "SQL Table: Employees"
    R1[Row: Sudeep]
    R2[DeptID: 101]
    end
    
    subgraph "The Graph Result"
    N1((Sudeep)) -- WORKS_IN --> N2((Dept: 101))
    N1 -- HAS_PROPERTY --> P1[Salary: 100k]
    end
    
    R1 --> N1
    R2 --> N2
    
    style N1 fill:#4285F4,color:#fff
    style N2 fill:#34A853,color:#fff

4. Implementation: A CSV-to-Cypher Ingester in Python

Let's write a script that generates the code to build a graph from a simple CSV.

import csv

# Mock CSV data
# id,name,manager_id
data = [
    ["1", "Alice", "None"],
    ["2", "Bob", "1"],
    ["3", "Charlie", "1"]
]

def generate_ingestion_script(rows):
    script = []
    for uid, name, mid in rows:
        # Create the Person node
        script.append(f"MERGE (p:Person {{id: '{uid}'}}) SET p.name = '{name}'")
        
        # Create the Relationship if manager exists
        if mid != "None":
            script.append(f"""
                MATCH (p:Person {{id: '{uid}'}})
                MATCH (m:Person {{id: '{mid}'}})
                MERGE (p)-[:REPORTS_TO]->(m)
            """)
    return script

# This script turns the flat CSV into a hierarchical graph tree.
for line in generate_ingestion_script(data):
    print(line)

5. Summary and Exercises

Structured data is the "Truth" that ground your graph.

IDs become nodes.
Foreign Keys become edges.
Denormalization optimizes the graph for AI retrieval (fewer hops).
Linking is the secret to combining SQL-certainty with NLP-intuition.

Exercises

Pivot Task: Take a table of "Song, Artist, Album, Year." How many node labels and relationship types would you create? What stays as a property?
Join vs. Edge: Why is searching for "Sudeep's Laptop" in a graph faster than a SQL JOIN between 3 tables (Employee, Assignment, Laptop)?
Data Quality: If a CSV has an empty row for "Manager," what logic should your ingestion script use to avoid creating an edge to a "Null" node?

In the next lesson, we will look at moving data: Ingestion from APIs and Live Streams.