Module 6 Lesson 5: Loading Data with Pandas
Unlock your data. Learn how to import real-world information from CSV, Excel, and JSON files into your Python scripts for analysis.
Module 6 Lesson 5: Loading Data with Pandas
In the last lesson, we hand-typed our data into dictionaries. In the real world, you’ll be dealing with thousands of rows stored in external files. Pandas makes "ingesting" this data incredibly easy with its read_* functions.
Lesson Overview
In this lesson, we will cover:
- The CSV Loader:
pd.read_csv(). - Excel and JSON Support: Handling other popular formats.
- Data Exploration:
info(),shape, andcolumns. - Handling Large Files: Memory considerations.
1. The CSV Specialist
CSV is the bread-and-butter of data science. Pandas handles it much faster and more intelligently than our Module 5 code.
import pandas as pd
# Load a local file
df = pd.read_csv("world_population.csv")
# You can even load data directly from a URL!
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
gdp_data = pd.read_csv(url)
2. Reading Excel and JSON
Pandas can also handle Microsoft Excel files (.xlsx) and JSON files effortlessly.
# Install 'openpyxl' first if you use Excel files
# df_excel = pd.read_excel("sales_report.xlsx", sheet_name="2024")
df_json = pd.read_json("user_activity.json")
3. Investigating Your New DataFrame
Once you've loaded a file, the first thing you should always do is "inspect" it to see what you're working with.
# 1. See how many rows and columns (Rows, Cols)
print(df.shape)
# 2. See the column names
print(df.columns)
# 3. See the data types and missing values
print(df.info())
4. Why Use Pandas Over Standard Files?
- Automatic Headers: Pandas automatically recognizes the first row as the column names.
- Type Detection: It correctly guesses that "Price" is a number and "Name" is a string.
- Speed: It uses professional-grade C-code under the hood to process files in milliseconds.
Practice Exercise: The Dataset Explorer
- Find a small CSV file online (e.g., from Kaggle or any public data source).
- Load it into a Pandas DataFrame.
- Print the
shapeof the data. - Print the
info()summary. - Use
head(10)to see the first 10 rows. - Identify one column that has missing values (look at the "Non-Null Count" in
info()).
Quick Knowledge Check
- What function is used to load a CSV file in Pandas?
- How do you see the total number of rows and columns in a DataFrame?
- Which method gives you the data types of every column?
- True or False: Pandas can load data from a web URL.
Key Takeaways
pd.read_csv()is the most common way to get data into Python.- The
info()method is the best way to start any data analysis. - Pandas handles the "heavy lifting" of file reading and type conversion for you.
- You can load multiple files and formats into the same script.
What’s Next?
Real-world data is messy. It has missing numbers, misspelled words, and weird symbols. In Lesson 6, we’ll learn the art of Data Cleaning—fixing those mistakes before they ruin our analysis!