Module 4 Lesson 5: Data provenance and integrity

If you don't know where your data came from, you can't trust what your AI says. Data Provenance (Lineage) and Integrity are the twin pillars of a secure data pipeline.

graph LR
    subgraph "Trust Chain"
    S[Source: Vendor/Sensor] -- "Sign & Hash" --> C[Ingestion: Verify Signature]
    C -- "Store as Immutable" --> D[Vault: WORM Storage]
    D -- "Verify Hash" --> T[Training Environment]
    T -- "Log Metadata" --> A[Lineage Audit Log]
    end

1. What is Data Provenance?

Provenance is the Accountability Trail for data.

"This image was captured by Camera A on 2024-01-01, stored in S3 Bucket B, and processed by Script C before being used for Training."
If you find a "Poisoned" sample, provenance tells you who uploaded it and when, allowing you to clear out other "Dirty" data from the same source.

2. Ensuring Data Integrity

Integrity ensures that the data wasn't changed after it was collected.

Technique: Hashing (Checksums).
- As soon as a dataset is finalized, calculate its SHA-256 hash.
- Before training begins, re-calculate the hash. If they don't match, someone (or a piece of malware) has tampered with your data.

3. The "Chain of Custody" for ML

Source Verification: Use digital signatures to verify that data from a vendor is actually from that vendor.
Immutability: Store finalized training sets in "Write Once, Read Many" (WORM) storage.
Audit Logging: Every time a data scientist "views" or "edits" the training data, it should be recorded in a tamper-proof log (like CloudTrail or a dedicated security log).

4. Modern Tools for Lineage

DVC (Data Version Control): Like Git but for large data files.
MLflow: Tracks entire experiments, including which data led to which version of the model.
OpenLineage: A standard for describing data movement between different systems (Spark, Airflow, SQL).

Exercise: The Investigator's Log

You discover a "Backdoor" trigger in your model. You have the hash of the dataset used for training. What is your first step?
Why is "Data Integrity" more difficult in a "Streaming" environment where the model is constantly learning from new user inputs?
What is the danger of using "Anonymous" data from a community hub without a signed manifest?
Research: What is "Software Bill of Materials" (SBOM) for data? (Often called a "Data Bill of Materials" or DBOM).

Summary

You have completed Module 4: Data Security and Data Poisoning. You now understand that data is a high-value security asset, how it can be poisoned with backdoors, why it might leak secrets, and how to build a "Trustworthy Pipeline" using provenance and integrity checks.

Next Module: The Silent Steal: Module 5: Model-Level Attacks.

Module 4 Lesson 5: Data Provenance & Integrity