Module 6 Lesson 6: Data Cleaning and Manipulation
The secret to accurate analysis. Learn how to handle missing data, drop duplicates, and transform your columns using powerful Pandas techniques.
Module 6 Lesson 6: Data Cleaning and Manipulation
In the industry, they say: "Garbage In, Garbage Out." If your data is messy, your analysis will be wrong. Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. In this lesson, we’ll learn the Pandas toolkit for scrubbing your data clean.
Lesson Overview
In this lesson, we will cover:
- Handling Missing Values:
isna(),dropna(), andfillna(). - Removing Duplicates: Cleaning up repeated rows.
- Renaming Columns: Making your headers readable.
- Filtering Rows: Using conditions to keep only what matters.
1. Finding and Fixing Missing Data
Missing data (called NaN in Pandas) is the most common problem.
import pandas as pd
# 1. Identify where missing data is
print(df.isna().sum())
# 2. Option A: Drop the rows with missing values
clean_df = df.dropna()
# 3. Option B: Fill missing values with something else (like the average)
df["Age"] = df["Age"].fillna(df["Age"].mean())
2. Removing Duplicates
Sometimes data is entered twice by mistake.
# Check how many duplicates exist
print(df.duplicated().sum())
# Remove them
df = df.drop_duplicates()
3. Renaming and Transforming Columns
Headers like clmn_prc_1 are hard to read. Let's fix them!
# Rename specific columns
df = df.rename(columns={"clmn_prc_1": "Price", "usr_id": "User_ID"})
# Perform math on a whole column
df["Price_with_Tax"] = df["Price"] * 1.08
4. Why 80% of Data Science is Cleaning?
Data comes from many sources. One person might write "USA," another "U.S.A.", and another "United States." A sensor might stop working and record 0 or null. If you don't fix these inconsistencies, your charts will look like a mess.
Practice Exercise: The Messy Sales Sheet
- Create a DataFrame with columns:
Product,Price, andUnits_Sold. - Include at least one
NaN(usingnp.nanfrom the NumPy module) and one duplicate row. - Remove the duplicate row.
- Fill the missing
Pricevalue with a fixed value like9.99. - Rename
Units_SoldtoQuantity. - Add a new column
Total_Revenue(Price*Quantity).
Quick Knowledge Check
- What does
NaNstand for in Pandas? - What is the difference between
dropna()andfillna()? - Why would you want to use the
inplace=Trueparameter (Self-research tip!)? - How do you rename columns in a DataFrame?
Key Takeaways
isna()helps you find the holes in your data.- Deciding whether to drop or fill missing data depends on your goal.
drop_duplicates()ensures every row is unique.- Column transformations allow you to create new insights from raw data.
What’s Next?
Clean data is ready for the real fun! In Lesson 7, we’ll learn how to Group and Aggregate your data to find meaningful summaries, like "What was the total profit for each city?"