
Training Data Management: Strategies
How to feed the beast. GCS Bucket structure, Managed Datasets, and improving I/O performance.
Where does the data live?
When training Custom Models on Vertex AI, the "Data Loading" step is often the bottleneck. Your GPU is fast ($3/hr). Your data loader is slow. You are burning money waiting for IO.
1. Storage Options
| Source | Speed | Use Case |
|---|---|---|
| BigQuery | Medium | Tabular data. Use the BigQuery Storage Read API (arrow format). |
| Cloud Storage (GCS) | High | Images/Videos. |
| NFS (Filestore) | Extreme | High-Performance Computing (HPC). If you need POSIX compliance. |
2. Vertex AI Managed Datasets
You can just point your code to a CSV in GCS. But using a Vertex AI Managed Dataset gives you:
- Splits: Automatically handle Train/Test/Validation splitting.
- Labeling: Integration with Data Labeling Service.
- Stats: Automatic histograms of data distribution.
3. Best Practice: GCS Optimization
If you have 1 million small images (10KB each), standard GCS reads will be slow (too many HTTP requests). Result: GPU utilization drops to 0%.
Solution 1: TFRecord (Pre-fetching)
Combine thousands of small files into one large binary file (data.tfrecord). This allows sequential reading.
Solution 2: GCS FUSE
Mount the bucket as a folder /gcs/my-bucket.
- Pros*: Easy (looks like local filesystem).
- Cons: Slower than native SDK if not tuned.
Knowledge Check
?Knowledge Check
You are training a ResNet model on 10 million small JPEG images stored in Cloud Storage. You notice your training speed is slow, and GPU utilization oscillates between 0% and 100%. What is the most effective fix?