
Troubleshooting Training: Common Failures
Why did my job fail? Debugging OOM errors, NaN losses, and 'Permission Denied'.
The "Red Logs"
The exam loves troubleshooting questions. You see an error message. What is the root cause?
1. Out of Memory (OOM)
- Symptom:
ResourceExhaustedError: OOM when allocating tensor... - Cause: The batch size is too large for the GPU VRAM.
- Fix:
- Decrease Batch Size.
- Use Gradient Accumulation (Simulate larger batch).
- Upgrade GPU (T4 -> A100).
2. Loss is NaN (Not a Number)
- Symptom:
Loss: NaNafter Step 100. - Cause: Gradient Explosion. The numbers became too big for float32.
- Fix:
- Clip Gradients (
clipnorm=1.0). - Decrease Learning Rate.
- Check for dirty data (e.g., dividing by zero).
- Clip Gradients (
3. Permission Denied (403)
- Symptom:
google.api_core.exceptions.PermissionDenied: 403 Access Not Configured - Cause: The Service Account running the training job does not have permission to read the GCS bucket or write logs.
- Fix: Grant
Storage Object Adminto the Compute Engine Service Account (or Custom SA).
4. Slow Training (Starvation)
- Symptom: Training takes 1 week instead of 1 day. GPU usage is low.
- Cause: Input Pipeline bottleneck.
- Fix: Use
tf.data.Dataset.prefetch()andcache().
Knowledge Check
?Knowledge Check
Your training job fails immediately with `PermissionDenied`. You verify that YOU (your user account) have Owner access to the Project. Why does it still fail?