
Source Control: Notebooks & Git
Notebooks are notoriously hard to version control. Learn patterns for nbdime, saving outputs, and refactoring to Python scripts.
The "Refactor" Step
A common exam scenario: "Data Scientists are committing .ipynb files with large output plots to Git. The repo is huge and diffs are unreadable."
1. The Notebook Problem
JSON files (ipynb) change every time you run them (execution count changes). Best Practices:
- Clear Outputs: Before committing, run "Cell > All Output > Clear".
- Jupytext: A tool that syncs the
.ipynbto a clean.pyfile automatically. You commit the.pyfile. - Vertex AI Integration: Workbench has a Git extension built-in. It helps visual diffing.
2. Refactoring to Modules
Rule: Never deploy a notebook to production.
- Prototype:
experiment.ipynb - Extract: Move function
clean_data()tosrc/preprocessing.py. - Import: Update notebook to
from src.preprocessing import clean_data. - Test: Write a unit test for
src/preprocessing.py. - Commit: Commit the
.pyfiles.
Knowledge Check
?Knowledge Check
Why is it considered bad practice to commit notebooks with execution outputs to a Git repository in an MLOps workflow?