Source Control: Notebooks & Git

Source Control: Notebooks & Git

Notebooks are notoriously hard to version control. Learn patterns for nbdime, saving outputs, and refactoring to Python scripts.

The "Refactor" Step

A common exam scenario: "Data Scientists are committing .ipynb files with large output plots to Git. The repo is huge and diffs are unreadable."


1. The Notebook Problem

JSON files (ipynb) change every time you run them (execution count changes). Best Practices:

  1. Clear Outputs: Before committing, run "Cell > All Output > Clear".
  2. Jupytext: A tool that syncs the .ipynb to a clean .py file automatically. You commit the .py file.
  3. Vertex AI Integration: Workbench has a Git extension built-in. It helps visual diffing.

2. Refactoring to Modules

Rule: Never deploy a notebook to production.

  1. Prototype: experiment.ipynb
  2. Extract: Move function clean_data() to src/preprocessing.py.
  3. Import: Update notebook to from src.preprocessing import clean_data.
  4. Test: Write a unit test for src/preprocessing.py.
  5. Commit: Commit the .py files.

Knowledge Check

?Knowledge Check

Why is it considered bad practice to commit notebooks with execution outputs to a Git repository in an MLOps workflow?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn