Module 6 Lesson 7: Grouping and Aggregation
·Data Science

Module 6 Lesson 7: Grouping and Aggregation

Summarize your data. Learn how to use 'groupby' to find totals, averages, and counts across different categories in your dataset.

Module 6 Lesson 7: Grouping and Aggregation

Data is often grouped into categories. You might have a list of sales across 10 different countries. To find out which country is making the most profit, you need to Group the data by country and Aggregate (calculate) the total profit. This is where the groupby() method comes in.

Lesson Overview

In this lesson, we will cover:

  • The "Split-Apply-Combine" Strategy: The philosophy of grouping.
  • The groupby() Method: Categorizing your data.
  • Aggregation Functions: sum(), mean(), count().
  • Multi-level Grouping: Grouping by more than one column.

1. Split, Apply, Combine

When you run a groupby, Pandas does three things:

  1. Split: It breaks the table into smaller tables (one for each group).
  2. Apply: It runs a math function (like sum) on each small table.
  3. Combine: It glues the results back together into a new summary table.

2. Using groupby()

Let's say you have a table of employees.

import pandas as pd

data = {
    "Name": ["Alex", "Sara", "Tom", "May"],
    "Dept": ["Sales", "Tech", "Sales", "Tech"],
    "Salary": [50000, 70000, 45000, 80000]
}
df = pd.DataFrame(data)

# Group by 'Dept' and find the average salary
summary = df.groupby("Dept")["Salary"].mean()
print(summary)

3. Multiple Aggregations

Sometimes you want the average, the max, and the total all at once. We use the .agg() method for this.

results = df.groupby("Dept")["Salary"].agg(["mean", "max", "sum", "count"])
print(results)

4. Resetting the Index

When you group data, the "grouped" column becomes the new index. To turn it back into a regular table, we use reset_index().

clean_summary = summary.reset_index()
print(clean_summary)

Practice Exercise: The Store Statistics

  1. Create a DataFrame of 10 products with columns: Category, Product_Name, and Price.
  2. Group the data by Category.
  3. Calculate the Average Price and the Total Count of products in each category.
  4. Print the category that has the most expensive average price.

Quick Knowledge Check

  1. What are the three steps of the "Split-Apply-Combine" strategy?
  2. Which function do you use to group data by a specific column?
  3. How do you calculate multiple statistics (like mean and sum) at once?
  4. What does reset_index() do to a grouped DataFrame?

Key Takeaways

  • groupby() is how you summarize data by category.
  • You must provide an aggregation function (like sum or mean) or Pandas won't know what to do with the groups.
  • Aggregation handles thousands of rows and turns them into 5-10 meaningful summary rows.
  • This is the same logic used in SQL GROUP BY and Excel Pivot Tables.

What’s Next?

Summary tables are great, but the human brain understands pictures better than numbers. In Lesson 8, we’ll start Data Visualization and learn how to turn these summary tables into beautiful Charts and Graphs!

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn