Module 7 Lesson 8: Building a Spam Filter Project
AI in action. Build a fully functional SMS/Email spam filter using the Naive Bayes algorithm and learn how computers process human language.
Module 7 Lesson 8: Building a Spam Filter Project
In this lesson, we build a real-world application: a Spam Filter. This involves a sub-field of AI called Natural Language Processing (NLP). Since machines can't read words, we first have to turn our text into numbers using a technique called "Vectorization."
Project Overview
We will:
- Take a dataset of 5,000 SMS messages.
- Clean the text (lowercase, remove punctuation).
- Convert the text into a "Bag of Words" (Numbers).
- Train a Multinomial Naive Bayes model (The industry standard for text).
- Test the filter on our own messages.
1. The Secrets of NLP: CountVectorizer
To turn text into numbers, we count how many times each word appears.
- "Win money now" -> [Win: 1, Money: 1, Now: 1]
- "Hey buddy" -> [Win: 0, Money: 0, Hey: 1, Buddy: 1]
Scikit-Learn provides CountVectorizer to do this for us.
2. Implementing the Filter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# 1. Simple Data
emails = [
"Win a free iPhone now!", # Spam
"Are we still on for lunch?", # Ham (Not Spam)
"URGENT: Your account has been compromised", # Spam
"Hey, did you see the game last night?", # Ham
]
labels = [1, 0, 1, 0] # 1=Spam, 0=Ham
# 2. Create a Pipeline (Vectorize then Train)
model = make_pipeline(CountVectorizer(), MultinomialNB())
# 3. Fit
model.fit(emails, labels)
# 4. Predict
test_email = ["You have won a lottery! Click here"]
print(f"Prediction (1=Spam): {model.predict(test_email)[0]}")
3. Why Naive Bayes?
Naive Bayes is a "Probabilistic" classifier. It calculates the probability that a word (like "FREE" or "WIN") appears in a Spam email vs. a Ham email. It's incredibly fast, requires very little data, and is still used by many email providers today!
Practice Exercise: The Tone Analyzer
- Create a dataset of "Happy" vs. "Sad" sentences.
- Build a pipeline using
CountVectorizerandMultinomialNB. - Train it on your sentences.
- Predict the mood of a new sentence like: "What a beautiful day to be outside!"
- Check the accuracy using the metrics we learned in Lesson 7.
Quick Knowledge Check
- What does NLP stand for?
- What is the purpose of
CountVectorizer? - Why do machines need text to be converted into numbers?
- Which algorithm is a classic choice for text classification?
Key Takeaways
- Text classification is one of the most common uses of AI.
- Vectorization turns human language into machine-readable math.
- Pipelines simplify the process of combining data prep and modeling.
- Naive Bayes is a fast and effective starting point for any NLP project.
What’s Next?
We’ve built a filter, but what happens if our filter is biased against certain people or cultures? In Lesson 9, we’ll discuss the most important topic in modern tech: AI Ethics and Bias!