Module 6 Lesson 10: Exploratory Data Analysis (EDA) Project
The data detective's guide. Follow a step-by-step Exploratory Data Analysis (EDA) to find trends, handle outliers, and visualize insights in a real-world dataset.
Module 6 Lesson 10: Exploratory Data Analysis (EDA) Project
You’ve learned the tools—NumPy for the math, Pandas for the tables, and Seaborn for the pictures. In this lesson, we bring them all together for an Exploratory Data Analysis (EDA). EDA is the "detective work" you do at the start of any data project to understand what the data is telling you.
The Objective
We will analyze a dataset of Global Movie Sales to answer three questions:
- Which movie genre makes the most money?
- Is there a relationship between the budget and the profit?
- How have movie ratings changed over the decades?
Step 1: Loading and Peeking
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("movies.csv")
print(df.head())
print(df.info())
Step 2: Cleaning the Mess
Real data always has missing values.
# Drop rows where we don't have a Budget or Gross revenue
df = df.dropna(subset=["budget", "gross"])
# Remove duplicates
df = df.drop_duplicates()
Step 3: Finding Insights (Grouping)
# What's the average profit per genre?
df["profit"] = df["gross"] - df["budget"]
genre_profit = df.groupby("genre")["profit"].mean().sort_values(ascending=False)
print(genre_profit)
Step 4: Visualizing the Story
sns.set_theme(style="whitegrid")
# 1. Bar plot for Genre Profit
plt.figure(figsize=(10, 6))
sns.barplot(x=genre_profit.index, y=genre_profit.values)
plt.xticks(rotation=45)
plt.title("Most Profitable Movie Genres")
plt.show()
# 2. Scatter plot for Budget vs Profit
sns.scatterplot(data=df, x="budget", y="profit", alpha=0.5)
plt.title("Does a bigger budget mean bigger profit?")
plt.show()
Summary of Findings
- Insight 1: Animation and Action movies tend to have the highest average profit.
- Insight 2: While higher budgets generally lead to higher profits, many high-budget movies also lead to massive losses (the "Busts").
- Insight 3: Movie ratings have become more "clumped" in recent years compared to the wild variability of the 1980s.
Practice Exercise: Your Own EDA
- Pick a dataset that interests you (Sports, Weather, Stocks, or Music).
- Perform a "Peeking" phase (
info(),describe()). - Fix at least one missing value.
- Create a Bar Chart comparing two categories.
- Write down 1 "Surprise" insight you found in the data.
Key Takeaways
- EDA is a cycle: Peek -> Clean -> Group -> Visualize.
- Calculating new columns (like
Profit) often provides the best insights. - Visualization is the best way to double-check your mathematical findings.
- Always document your insights as you find them!
What’s Next?
You’ve completed your first full analysis! In Lesson 11, we’ll look at Hands-on Projects that take these skills to the next level—including analyzing your own Spotify or Netflix data!