Module 7 Lesson 5: Logistic Regression: Classification
Binary decisions made simple. Learn how to use Logistic Regression to categorize data into 'Yes or No' classes like Spam vs. No Spam or Pass vs. Fail.
Module 7 Lesson 5: Logistic Regression: Classification
Despite the name having "Regression" in it, Logistic Regression is actually used for Classification. It doesn't predict a continuous number (45.2); it predicts a Probability that something belongs to a specific category (e.g., There is a 95% chance this email is Spam).
Lesson Overview
In this lesson, we will cover:
- What is Classification?: Categorizing data into classes.
- The Sigmoid Function: Turning numbers into probabilities.
- Implementation: Building a model with
LogisticRegression(). - Binary vs. Multi-class: Yes/No vs. Red/Blue/Green.
1. Regression vs. Classification
- Linear Regression: Predicting How Much? (House Price, Temperature).
- Logistic Regression: Predicting Which One? (Spam/No Spam, Pass/Fail, Malignant/Benign).
2. Coding the Model
Let's build a model that predicts whether a student will Pass or Fail an exam based on the hours they studied.
import numpy as np
from sklearn.linear_model import LogisticRegression
# 1. Prepare Data (Hours Studied)
X = np.array([[1], [2], [3], [5], [6], [7], [8]])
# 0 = Fail, 1 = Pass
y = np.array([0, 0, 0, 1, 1, 1, 1])
# 2. Instantiate
model = LogisticRegression()
# 3. Fit
model.fit(X, y)
# 4. Predict for a student who studied 4 hours
new_student = np.array([[4]])
prediction = model.predict(new_student)
probability = model.predict_proba(new_student)
print(f"Prediction (0=Fail, 1=Pass): {prediction[0]}")
print(f"Probability of Passing: {probability[0][1] * 100:.2f}%")
3. The Math Magic: The Sigmoid Curve
Linear Regression draws a straight line. Logistic Regression draws an "S" shaped curve (the Sigmoid).
- Values at the top of the S are pushed toward 1 (True).
- Values at the bottom of the S are pushed toward 0 (False).
Practice Exercise: The Email Filter
- Imagine a dataset where
Xis the number of times the word "Win" appears in an email. yis1for Spam and0for Not Spam.- Design a small dataset (5-10 rows) that shows a clear trend (more "wins" = more likely spam).
- Fit a
LogisticRegressionmodel. - Predict the status of an email that has "Win" appearing 15 times!
Quick Knowledge Check
- Is Logistic Regression used for predicting numbers or categories?
- What is the name of the "S" shaped curve used in this model?
- What does
predict_proba()return? - Why wouldn't you use Linear Regression (a straight line) for classification? (Hint: A straight line could predict a value of -5 or 2, which makes no sense for a category!).
Key Takeaways
- Logistic Regression is the foundational algorithm for binary classification.
- It predicts probabilities before assigning a final class.
- It is used in medical diagnosis, credit scoring, and spam detection.
- The Scikit-Learn pattern remains the same: Import -> Instantiate -> Fit -> Predict.
What’s Next?
Logistic Regression is great for straight-forward decisions. But what if the rules are more complex (e.g., "If it's sunny AND the temperature is > 20 AND it's a weekend...")? In Lesson 6, we’ll learn about Decision Trees!