Troubleshooting Training: Common Failures

Troubleshooting Training: Common Failures

Why did my job fail? Debugging OOM errors, NaN losses, and 'Permission Denied'.

The "Red Logs"

The exam loves troubleshooting questions. You see an error message. What is the root cause?


1. Out of Memory (OOM)

  • Symptom: ResourceExhaustedError: OOM when allocating tensor...
  • Cause: The batch size is too large for the GPU VRAM.
  • Fix:
    1. Decrease Batch Size.
    2. Use Gradient Accumulation (Simulate larger batch).
    3. Upgrade GPU (T4 -> A100).

2. Loss is NaN (Not a Number)

  • Symptom: Loss: NaN after Step 100.
  • Cause: Gradient Explosion. The numbers became too big for float32.
  • Fix:
    1. Clip Gradients (clipnorm=1.0).
    2. Decrease Learning Rate.
    3. Check for dirty data (e.g., dividing by zero).

3. Permission Denied (403)

  • Symptom: google.api_core.exceptions.PermissionDenied: 403 Access Not Configured
  • Cause: The Service Account running the training job does not have permission to read the GCS bucket or write logs.
  • Fix: Grant Storage Object Admin to the Compute Engine Service Account (or Custom SA).

4. Slow Training (Starvation)

  • Symptom: Training takes 1 week instead of 1 day. GPU usage is low.
  • Cause: Input Pipeline bottleneck.
  • Fix: Use tf.data.Dataset.prefetch() and cache().

Knowledge Check

?Knowledge Check

Your training job fails immediately with `PermissionDenied`. You verify that YOU (your user account) have Owner access to the Project. Why does it still fail?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn