
Measuring Hallucination: The Multi-Hop Reality Check
Detect the invisible lies. Learn how multi-hop reasoning increases the risk of 'Imaginary Links' and how to build automated checks to verify every step of the AI's logical chain.
Measuring Hallucination: The Multi-Hop Reality Check
Hallucination is the "Final Boss" of RAG. In standard vector RAG, a hallucination is usually a factual error ("He was born in 1980" instead of 1985). But in Graph RAG, we face a more subtle and dangerous lie: The Hallucinated Relationship.
This happens when an LLM correctly finds Node A and Node C, but invents the connection (Node B) between them. Because it looks logical, it is very hard for a human to catch. In this lesson, we will learn how to detect these "Logical Fabrications" using Automated Path Verification and Triplet Extraction Cross-Referencing.
1. Why Multi-Hop Reasoning Increases High-Stakes Lies
When an LLM performs a 3-hop reasoning chain, it is essentially "Connecting the Dots" in its mind.
- Node A: Sudeep
- Node C: Project Titan
- Hallucination: "Sudeep is the CREATOR of Project Titan."
- Graph Reality: Sudeep is just a MEMBER, and the Creator is Jane.
The LLM "Smoothes" the relationship to make it sound more impressive or direct. This is Relationship Infidelity.
2. Strategy: The Triplet-Audit Loop
To detect this, we use a post-processing step:
- Extract Triplets from Answer: We ask a separate (small) LLM: "Extract all Subject-Predicate-Object facts from this AI answer."
- Verify against Graph: For every extracted fact (S, P, O), we perform a direct query to the Graph Database:
MATCH (S)-[r]->(O) RETURN type(r). - Conflict Score: If the AI says
[Sudeep] -[:CREATOR]-> [Titan]and the DB says "Relationship not found," we Flag it as a Hallucination.
3. The "Evidence Score" Metric
For every answer your bot gives, you should display an Evidence Score.
- 100%: Every word in the answer is backed by a direct edge in the retrieved subgraph.
- 50%: Half the facts are from the graph; the other half are "Common Sense" from the LLM's weights.
RAG Pro Tip: In financial or medical RAG, you should block any answer with an Evidence Score below 90%.
graph TD
A[AI Answer] --> E[Triplet Extractor]
E -->| Triplet: S-P-O | V[Graph Verifier]
V -->|Query| DB[(Knowledge Graph)]
DB -->|Exists?| RES[Resolution]
RES -->|No| H[Hallucination Flag]
RES -->|Yes| G[Grounded Fact]
style H fill:#f44336,color:#fff
style G fill:#34A853,color:#fff
4. Implementation: Verifying a Chain in Python
def verify_answer(answer_text, retrieved_subgraph):
# 1. Get the claims the AI made
claims = llm.extract_claims(answer_text)
# 2. Check each claim
verified_count = 0
for claim in claims:
# Check if this exact relationship exists in our graph evidence
if is_in_graph(claim, retrieved_subgraph):
verified_count += 1
return verified_count / len(claims) # The Reliability Score
5. Summary and Exercises
Hallucination in graphs is a failure of Connectivity Integrity.
- Relationship Infidelity occurs when the AI "Simplifies" a path.
- Triplet Extraction allows for programmatic "Fact Checking."
- Evidence Scores provide transparency to the end user.
- Multi-Step verification is necessary whenever the reasoning exceeds 1 hop.
Exercises
- Lie Detection: An AI says "Dr. Smith treated the patient with Aspirin." You look at the graph and see
[Dr. Smith] -[:WORKS_IN]-> [Hospital] <-[:TREATED_IN]- [Patient]. Did the AI hallucinate a relationship? Why? - Threshold Setting: If your bot is for "Movie Trivia," what is an acceptable Hallucination Rate? What if the bot is for "Prescription Drug Interactions"?
- Visualization: Draw a 3-hop path that is "True." Now, write a "Hallucinated" version of that path that sounds believable but changes the core relationship.
In the next lesson, we will look at technical benchmarks: End-to-End Performance Benchmarking.