Module 7 Lesson 6: Decision Trees and Random Forests

Regression and Logistic Regression use math formulas to find a "line." But what if your data doesn't follow a line? What if it follows a set of rules, like: "If the weather is sunny AND I have a day off AND my friend is free, I will go to the beach"? This is exactly how Decision Trees and Random Forests work.

Lesson Overview

In this lesson, we will cover:

Decision Trees: The simple "Flowchart" logic.
Overfitting: Why one tree is often too smart for its own good.
Random Forests: The Power of the "Ensemble."
The Voting System: How a forest reaches a conclusion.

1. What is a Decision Tree?

A Decision Tree is a flowchart-like structure where each "node" represents a choice. Example: Will a user cancel their subscription?

Is the user active in the last 30 days? -> No -> CANCEL
Is the user active? -> Yes -> Has the user contacted support? -> Yes -> CANCEL
Active? -> Yes -> Contacted support? -> No -> RETAIN

from sklearn.tree import DecisionTreeClassifier

# Simple example: [Age, Income]
X = [[25, 50000], [35, 70000], [45, 100000], [20, 20000]]
y = [0, 1, 1, 0] # 0 = No car, 1 = Has car

model = DecisionTreeClassifier()
model.fit(X, y)

2. The Danger of Overfitting

A single Decision Tree is very good at memorizing your training data. It will create a rule for every single detail. This is called Overfitting. The tree becomes so specialized to your data that it fails when it sees a new, real-world example.

3. Random Forests (The Wisdom of the Crowd)

To fix the tree problem, we use a Random Forest. Instead of one smart tree, we build 100 random trees.

Each tree sees a different random part of the data.
Each tree is slightly "dumb" on its own.
But when it's time to predict, they all Vote.

The result is a model that is incredibly stable and accurate—often much better than a single tree or even a logistic regression.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100) # 100 trees!
model.fit(X, y)

4. Why Use Forests?

Versatility: They work for both numbers (Regression) and categories (Classification).
No Preprocessing: Unlike some models, they don't care if your numbers are big or small.
Insight: They can tell you which features were most important (e.g., "Income was more important than Age in this model").

Practice Exercise: The Forest Builder

Create a dataset for predicting if a fruit is an Apple or an Orange based on Weight and Texture (1-10).
Import RandomForestClassifier.
Train (Fit) the model.
Predict a fruit that is heavy (200g) but very smooth (Texture 2).
Check what happens if you change the number of trees (n_estimators) from 10 to 500.

Quick Knowledge Check

What is the visual analogy of a Decision Tree?
What is "Overfitting"?
How does a Random Forest decide on a final prediction?
What is the main benefit of using a Forest over a single Tree?

Key Takeaways

Decision Trees follow a simple "if-then" logic.
A single tree is prone to "overfitting" or memorizing.
Random Forests are an "Ensemble" method that combines many trees.
The "Wisdom of the Crowd" makes Random Forests one of the most reliable ML tools in existence.

What’s Next?

We’ve built many models now. But how do we know they are actually good? Is "90% accuracy" actually a good thing? In Lesson 7, we’ll learn how to evaluate our models like a data pro!