Cleansing and Conflict Resolution: The Filter of Truth

In a perfect world, all your data sources agree. In the real world, Source A says "Sudeep is in London," Source B says "Sudeep is in Tokyo," and Source C says "Sudeep left the company." If you put all three in your graph without a strategy, your AI agent will be hopelessly confused—and a confused agent is an agent that hallucinates.

In this lesson, we will learn how to build a Consensus Engine for your Knowledge Graph. We will explore Fact Probabilities, Source Authority, and Conflict Management. We will see how to handle the "He Said/She Said" problem in data and why "The Most Recent Fact" isn't always the "True Fact."

1. The Conflict Types

Direct Contradiction: (Node)-[:STAYS_IN]->(London) vs (Node)-[:STAYS_IN]->(Tokyo).
Attribute Stale-ness: Source A has 2023 salary data; Source B has 2024 salary data.
Entity Confusion: Source A thinks JS-101 is "JavaScript," Source B thinks it is "Job Sheet 101."

2. Strategy 1: Authority Ranking (Source Weighting)

Not all sources are equal. You should assign an Authority Score to your connectors.

Source: HR Database -> Authority: 1.0 (The ultimate truth).
Source: Slack Channel -> Authority: 0.4 (Maybe rumors).
Source: Scraping Web -> Authority: 0.1 (Unverified).

Resolution Rule: If HR says "London" and Slack says "Tokyo," the HR fact Overwrites the Slack fact.

3. Strategy 2: Temporal Priority (The "Last Update" Wins)

This is the simplest resolution logic.

"Whatever fact has the newest timestamp property is the truth."

Danger: What if the Newest fact is a typo or a malicious injection? This is why Temporal Priority should usually be used within the same Authority level.

4. Strategy 3: Multi-Truth Representation

Sometimes, there is no "True" fact.

Query: "Is Sudeep working on the Graph project?"
Graph: Source A says YES, Source B says NO.

Solution: Don't resolve. Store both. (Sudeep) -[:MENTIONED_AS_MEMBER {source: 'Slack', confidence: 0.6}]-> (Graph) (Sudeep) -[:NOT_IN_ROSTER {source: 'HR', confidence: 1.0}]-> (Graph)

AI Outcome: The agent can now tell the user: "According to the HR records, Sudeep is not on the team, but there are discussions in Slack indicating he might be contributing." This is Transparent AI.

graph TD
    A[Source A: HR] -->|Fact: London| CR[Conflict Resolver]
    B[Source B: Slack] -->|Fact: Tokyo| CR
    CR -->|Auth Check| KG[(Knowledge Graph)]
    KG -->|Result| F[Final Fact: London]
    
    style A fill:#34A853,color:#fff
    style B fill:#f4b400,color:#fff
    style F fill:#4285F4,color:#fff

5. Implementation: A Conflict-Aware Ingester

Let's write a Python function that implements Source Authority.

source_authority = {
    "HR": 1.0,
    "Jira": 0.8,
    "Slack": 0.4
}

# Current state of the graph
current_graph = {
    "Sudeep": {"location": "London", "weight": 1.0}
}

def update_property(entity, key, val, source):
    new_weight = source_authority.get(source, 0.1)
    current_weight = current_graph.get(entity, {}).get("weight", 0)
    
    if new_weight >= current_weight:
        current_graph[entity][key] = val
        current_graph[entity]["weight"] = new_weight
        print(f"Updated {entity} {key} to {val} (Via {source})")
    else:
        print(f"Rejected {val} from {source}. Current {current_weight} > {new_weight}")

# TEST
update_property("Sudeep", "location", "Tokyo", "Slack") # REJECTED
update_property("Sudeep", "location", "Remote", "HR")    # UPDATED

6. Summary and Exercises

Conflict resolution is the "Immune System" of your Knowledge Graph.

Authority Ranking ensures high-fidelity sources win.
Temporal Priority handles sequential updates.
Multi-Truth stores conflicting perspectives for the LLM to analyze.
Transparency is better than False Certainty.

Exercises

Authority Ranking: You have three sources: Wikipedia, A Peer-Reviewed Journal, and X (Twitter). Rank them from 1 to 10 for a graph about "Medical Statistics."
Conflict Narrative: How would you prompt an LLM to explain that two departments disagree on a project's deadline?
The Overwrite Risk: If the CEO makes a typo in an email ("Project starts in 1999" instead of "2009"), and the CEO source has a high authority, how does your system recover from authorized errors?

In the next lesson, we will look at how to scale this entire process: Scaling Ingestion with Distributed Systems.