Module 6 Lesson 6: Data Cleaning and Manipulation

In the industry, they say: "Garbage In, Garbage Out." If your data is messy, your analysis will be wrong. Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. In this lesson, we’ll learn the Pandas toolkit for scrubbing your data clean.

Lesson Overview

In this lesson, we will cover:

Handling Missing Values: isna(), dropna(), and fillna().
Removing Duplicates: Cleaning up repeated rows.
Renaming Columns: Making your headers readable.
Filtering Rows: Using conditions to keep only what matters.

1. Finding and Fixing Missing Data

Missing data (called NaN in Pandas) is the most common problem.

import pandas as pd

# 1. Identify where missing data is
print(df.isna().sum())

# 2. Option A: Drop the rows with missing values
clean_df = df.dropna()

# 3. Option B: Fill missing values with something else (like the average)
df["Age"] = df["Age"].fillna(df["Age"].mean())

2. Removing Duplicates

Sometimes data is entered twice by mistake.

# Check how many duplicates exist
print(df.duplicated().sum())

# Remove them
df = df.drop_duplicates()

3. Renaming and Transforming Columns

Headers like clmn_prc_1 are hard to read. Let's fix them!

# Rename specific columns
df = df.rename(columns={"clmn_prc_1": "Price", "usr_id": "User_ID"})

# Perform math on a whole column
df["Price_with_Tax"] = df["Price"] * 1.08

4. Why 80% of Data Science is Cleaning?

Data comes from many sources. One person might write "USA," another "U.S.A.", and another "United States." A sensor might stop working and record 0 or null. If you don't fix these inconsistencies, your charts will look like a mess.

Practice Exercise: The Messy Sales Sheet

Create a DataFrame with columns: Product, Price, and Units_Sold.
Include at least one NaN (using np.nan from the NumPy module) and one duplicate row.
Remove the duplicate row.
Fill the missing Price value with a fixed value like 9.99.
Rename Units_Sold to Quantity.
Add a new column Total_Revenue (Price * Quantity).

Quick Knowledge Check

What does NaN stand for in Pandas?
What is the difference between dropna() and fillna()?
Why would you want to use the inplace=True parameter (Self-research tip!)?
How do you rename columns in a DataFrame?

Key Takeaways

isna() helps you find the holes in your data.
Deciding whether to drop or fill missing data depends on your goal.
drop_duplicates() ensures every row is unique.
Column transformations allow you to create new insights from raw data.

What’s Next?

Clean data is ready for the real fun! In Lesson 7, we’ll learn how to Group and Aggregate your data to find meaningful summaries, like "What was the total profit for each city?"