Capstone Project: The Automated Data Scientist
The Grand Finale. Apply everything from Modules 1-7 to build a fully automated system that cleans raw data, performs analysis, and chooses the best AI model.
Capstone Project: The Automated Data Scientist
Welcome to the Capstone Project. This is not a lesson—it is a challenge. You will build a system that acts like a junior data scientist. It takes a raw, messy dataset and performs the entire pipeline automatically.
1. Project Requirements
Your "Automated Data Scientist" must perform the following steps:
- Ingestion: Load any CSV file passed to it.
- Cleaning: Automatically identify and fill missing values.
- Analysis: Print basic stats and a Correlation Heatmap (Seaborn).
- Modeling: Train both a Linear Regression and a Random Forest model.
- Comparison: Print which model performed better using the metrics from Module 7.
2. Architecture: The Modular Approach
To build this professional-level project, you should use the Object-Oriented skills from Module 4 and the Error Management from Module 5.
Recommended Structure:
data_handler.py: A class for loading and cleaning.analysis_engine.py: Functions for math and plotting.model_builder.py: A class to handle the Scikit-Learn patterns.main.py: The entry point that ties everything together.
3. High-Level Logic Walkthrough
Step A: The Cleaner
Build a function that looks at every column. If it's a number, fill NaN with the mean. If it's text, fill NaN with "Unknown".
Step B: The Multi-Trainer
Create a loop that takes a list of models:
models = [LinearRegression(), RandomForestRegressor()]
for m in models:
m.fit(X_train, y_train)
score = m.score(X_test, y_test)
print(f"{type(m).__name__} Score: {score}")
4. Final Submission
A complete project should include:
- The Source Code: Well-commented Python files.
- A README: Explaining how to run the system and what libraries (
pip install) are needed. - Sample Output: A screenshot or text file showing the analysis and model comparison results.
Advice for Success
- Fail Gracefully: Use
try-exceptblocks. If the CSV is broken, don't crash—tell the user why. - Visualize: A picture is worth a thousand stats. Make sure your "Automated Scientist" saves at least one PNG chart.
- Keep it Modular: Don't write everything in one file. Break it down into components.
Course Wrap-up
You have completed the entire "Python from Basics to AI" curriculum. You now have a portfolio-ready project that proves you can:
- Write clean, professional Python code.
- Handle and clean real-world data at scale.
- Build and evaluate predictive AI models.
The future is yours. What will you build next?