Module 11 Lesson 2: Correlation vs. Causation

One of the most important rules in science is: Correlation is not Causation.

Correlation: Every time I see a rainbow, I see people with umbrellas. (They happen together).
Causation: Rain causes the rainbow and causes people to use umbrellas. (One leads to the other).

LLMs are built entirely on Correlation. In this lesson, we explore why this makes them brilliant at predicting patterns but dangerous when used to determine cause and effect.

1. The "Bullseye" Problem: Frequent Failures

Because LLMs are trained on common data, they are biased toward "popular" mistakes.

If 70% of the internet incorrectly believes that "shaving makes hair grow back thicker," the LLM will statistically correlate "Shaving" with "Thicker hair."
The model doesn't check the biology; it just checks the frequency of the word pattern.

2. No Sense of "If. Then. Else."

In traditional software, we write explicit causal rules: If temperature > 100 then status = BOILING.

In an LLM, there are no rules. There are only likely continuations. If the model sees the sentence "The sun rose because...", it will likely say "the earth rotated on its axis." Not because it understands planetary physics, but because that is the most common correlation in its library.

If you ask a confusing causal question: "Did the rooster's crowing cause the sun to rise?", a poorly-aligned model might say "Yes," simply because "Rooster" and "Sunrise" have a very high correlation in its training data.

graph LR
    subgraph "Correlation (What LLMs see)"
        A["Ice Cream Sales UP"] --- B["Drowning Incidents UP"]
    end
    subgraph "Causation (What Humans see)"
        C["SUMMER HEAT"] -- "causes" --> A
        C -- "causes" --> B
    end

3. The "Stochastic Parrot" Debate

Critics often call LLMs "Stochastic Parrots." Just as a parrot can mimic the sound of a doorbell without knowing what a doorbell is, an LLM can mimic the sound of a logical argument without understanding the "Why" behind the logic.

As we move toward Agentic AI (where AI makes decisions), this lack of causal understanding is a major risk. An AI might "correlate" a server crash with a specific user logging in, and "decide" to ban that user, even if the real cause was a background software bug.

4. How to Bridge the Gap

To help LLMs with causation, we use:

Structured Prompting: Forcing the model to list preconditions before drawing a conclusion.
Formal Verification: Using external logic engines (like a Python interpreter) to verify that the AI's "correlated" math actually works in reality.

Lesson Exercise

Goal: Test Causal Logic.

Ask an LLM: "If I have a balloon in a car and I accelerate forward, which way does the balloon move?"
Watch the AI's response. This is a classic physics test (the answer is 'Forward' due to air pressure).
Does the AI give you the "popular" wrong answer (it moves Backward) or the "correct" causal answer?

Observation: You'll see if the AI is "echoing" the internet's intuition or "calculating" the physics.

Summary

In this lesson, we established:

LLMs are masters of correlation but blind to causation.
Frequency of data determines the model's "truth," leading to the repetition of popular myths.
Without external logical grounding, LLMs can't distinguish between things that happen together and things that cause each other.

Next Lesson: We look at the "Long Game." We'll learn about Long-Horizon Planning and why AI agents often get "lost" in complex tasks.