Module 6 Lesson 7: Grouping and Aggregation
Summarize your data. Learn how to use 'groupby' to find totals, averages, and counts across different categories in your dataset.
Module 6 Lesson 7: Grouping and Aggregation
Data is often grouped into categories. You might have a list of sales across 10 different countries. To find out which country is making the most profit, you need to Group the data by country and Aggregate (calculate) the total profit. This is where the groupby() method comes in.
Lesson Overview
In this lesson, we will cover:
- The "Split-Apply-Combine" Strategy: The philosophy of grouping.
- The
groupby()Method: Categorizing your data. - Aggregation Functions:
sum(),mean(),count(). - Multi-level Grouping: Grouping by more than one column.
1. Split, Apply, Combine
When you run a groupby, Pandas does three things:
- Split: It breaks the table into smaller tables (one for each group).
- Apply: It runs a math function (like
sum) on each small table. - Combine: It glues the results back together into a new summary table.
2. Using groupby()
Let's say you have a table of employees.
import pandas as pd
data = {
"Name": ["Alex", "Sara", "Tom", "May"],
"Dept": ["Sales", "Tech", "Sales", "Tech"],
"Salary": [50000, 70000, 45000, 80000]
}
df = pd.DataFrame(data)
# Group by 'Dept' and find the average salary
summary = df.groupby("Dept")["Salary"].mean()
print(summary)
3. Multiple Aggregations
Sometimes you want the average, the max, and the total all at once. We use the .agg() method for this.
results = df.groupby("Dept")["Salary"].agg(["mean", "max", "sum", "count"])
print(results)
4. Resetting the Index
When you group data, the "grouped" column becomes the new index. To turn it back into a regular table, we use reset_index().
clean_summary = summary.reset_index()
print(clean_summary)
Practice Exercise: The Store Statistics
- Create a DataFrame of 10 products with columns:
Category,Product_Name, andPrice. - Group the data by
Category. - Calculate the Average Price and the Total Count of products in each category.
- Print the category that has the most expensive average price.
Quick Knowledge Check
- What are the three steps of the "Split-Apply-Combine" strategy?
- Which function do you use to group data by a specific column?
- How do you calculate multiple statistics (like mean and sum) at once?
- What does
reset_index()do to a grouped DataFrame?
Key Takeaways
groupby()is how you summarize data by category.- You must provide an aggregation function (like
sumormean) or Pandas won't know what to do with the groups. - Aggregation handles thousands of rows and turns them into 5-10 meaningful summary rows.
- This is the same logic used in SQL
GROUP BYand Excel Pivot Tables.
What’s Next?
Summary tables are great, but the human brain understands pictures better than numbers. In Lesson 8, we’ll start Data Visualization and learn how to turn these summary tables into beautiful Charts and Graphs!