
Troubleshooting Common Errors in Training and Serving
How to troubleshoot common errors in training and serving. A guide to debugging your ML models.
When Things Go Wrong
No matter how carefully you design and build your ML system, things will inevitably go wrong. When they do, it's important to have a systematic approach to troubleshooting.
1. Common Training Errors
ResourceExhaustedError: This error occurs when your training job runs out of memory. To fix this, you can try:- Decreasing your batch size.
- Using a smaller model.
- Using a machine with more memory.
InvalidArgumentError: This error occurs when you provide an invalid argument to a TensorFlow operation. To fix this, you should carefully check the documentation for the operation that is causing the error.PermissionDeniedError: This error occurs when your training job does not have permission to access a resource, such as a file in a Cloud Storage bucket. To fix this, you should check the IAM permissions for the service account that is running your training job.
2. Common Serving Errors
503 Service Unavailable: This error occurs when your model is not able to handle the volume of requests that it is receiving. To fix this, you can try:- Increasing the number of nodes in your endpoint.
- Using a larger machine type.
- Optimizing your model for performance.
400 Bad Request: This error occurs when you send an invalid request to your model. To fix this, you should carefully check the documentation for your model's API.403 Forbidden: This error occurs when you do not have permission to access your model. To fix this, you should check the IAM permissions for your user account or service account.
3. Debugging Tools
- Cloud Logging: You can use Cloud Logging to view the logs for your training jobs and your model endpoints. This can be a valuable source of information for debugging errors.
- Cloud Profiler: You can use Cloud Profiler to profile the performance of your training jobs and your model endpoints. This can help you identify any performance bottlenecks.
- The What-If Tool: You can use the What-If Tool to explore the behavior of your model and identify any fairness or bias issues.
Knowledge Check
?Knowledge Check
You are training a model on Vertex AI. Your training job fails with a `ResourceExhaustedError`. What is the most likely cause of this error?