Module 7 Lesson 8: Building a Spam Filter Project

In this lesson, we build a real-world application: a Spam Filter. This involves a sub-field of AI called Natural Language Processing (NLP). Since machines can't read words, we first have to turn our text into numbers using a technique called "Vectorization."

Project Overview

We will:

Take a dataset of 5,000 SMS messages.
Clean the text (lowercase, remove punctuation).
Convert the text into a "Bag of Words" (Numbers).
Train a Multinomial Naive Bayes model (The industry standard for text).
Test the filter on our own messages.

1. The Secrets of NLP: CountVectorizer

To turn text into numbers, we count how many times each word appears.

"Win money now" -> [Win: 1, Money: 1, Now: 1]
"Hey buddy" -> [Win: 0, Money: 0, Hey: 1, Buddy: 1]

Scikit-Learn provides CountVectorizer to do this for us.

2. Implementing the Filter

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 1. Simple Data
emails = [
    "Win a free iPhone now!", # Spam
    "Are we still on for lunch?", # Ham (Not Spam)
    "URGENT: Your account has been compromised", # Spam
    "Hey, did you see the game last night?", # Ham
]
labels = [1, 0, 1, 0] # 1=Spam, 0=Ham

# 2. Create a Pipeline (Vectorize then Train)
model = make_pipeline(CountVectorizer(), MultinomialNB())

# 3. Fit
model.fit(emails, labels)

# 4. Predict
test_email = ["You have won a lottery! Click here"]
print(f"Prediction (1=Spam): {model.predict(test_email)[0]}")

3. Why Naive Bayes?

Naive Bayes is a "Probabilistic" classifier. It calculates the probability that a word (like "FREE" or "WIN") appears in a Spam email vs. a Ham email. It's incredibly fast, requires very little data, and is still used by many email providers today!

Practice Exercise: The Tone Analyzer

Create a dataset of "Happy" vs. "Sad" sentences.
Build a pipeline using CountVectorizer and MultinomialNB.
Train it on your sentences.
Predict the mood of a new sentence like: "What a beautiful day to be outside!"
Check the accuracy using the metrics we learned in Lesson 7.

Quick Knowledge Check

What does NLP stand for?
What is the purpose of CountVectorizer?
Why do machines need text to be converted into numbers?
Which algorithm is a classic choice for text classification?

Key Takeaways

Text classification is one of the most common uses of AI.
Vectorization turns human language into machine-readable math.
Pipelines simplify the process of combining data prep and modeling.
Naive Bayes is a fast and effective starting point for any NLP project.

What’s Next?

We’ve built a filter, but what happens if our filter is biased against certain people or cultures? In Lesson 9, we’ll discuss the most important topic in modern tech: AI Ethics and Bias!