The Hardest Part: Data Collection and Preparation

The Hardest Part: Data Collection and Preparation

Garbage In, Garbage Out. Master the '80% rule' of machine learning and learn how to clean, label, and prepare your data for success.

The foundation of the machine

When people think of AI, they think of the cool parts—training a model or calling an API. But in reality, 80% of an AI project is spent on Data.

If you give a model "Dirty" data (data with missing values, errors, or bias), the model will produce "Dirty" results. This is known as the GIGO Principle (Garbage In, Garbage Out). In our first lesson of Module 13, we walk through the "Unsexy" but essential step of data preparation.


1. The Stages of Data Prep

A. Data Collection (Ingestion)

Where does the data live? Is it in an S3 bucket? An On-prem database?

  • You must move the data to a central location (usually an Amazon S3 Data Lake) so the AI services can access it.

B. Data Cleaning (Pre-processing)

Real-world data is messy.

  • You need to handle Missing Values (What if the age field is empty?).
  • You need to handle Outliers (What if a customer is listed as being 500 years old?).
  • You need to Normalize data (Converting all currencies to USD so they can be compared).

C. Data Labeling (Ground Truth)

If you are doing Supervised Learning, you need the "Answers."

  • You need to tag the emails as "Spam" or "Not Spam."
  • Service Tip: Amazon SageMaker Ground Truth is the service used to manage human labelers for this step.

2. Structured vs. Unstructured Data

For the exam, you must know what "Form" your data is in:

  • Structured Data: Fits neatly in a spreadsheet or a database. (e.g., Sales numbers, Customer names).
    • Perfect for: Amazon Forecast, SageMaker Canvas.
  • Unstructured Data: Messy, real-world formats. (e.g., Photos, Audio files, PDFs, Tweets).
    • Perfect for: Amazon Rekognition, Comprehend, Textract.

3. How to Do This on AWS

  • SageMaker Data Wrangler: The "No-Code" way to clean data (you can see the charts of your data errors visually).
  • AWS Glue: The "Plumbing" way to clean data (using ETL - Extract, Transform, Load - jobs).
  • SageMaker Studio Notebooks: The "Hard-Code" way (writing Python/Pandas scripts to clean the data).

4. Visualizing the Data Pipeline

graph LR
    A[Raw Data Sources: Logs/S3/DB] --> B[Ingestion: AWS DataSync/Glue]
    B --> C[CLEANING: Data Wrangler]
    C --> D[LABELING: Ground Truth]
    D --> E[FINAL DATASET: Training-Ready]
    
    subgraph The_Practitioner_Focus
    C
    D
    end

5. Summary: Quality is Your Job

Data preparation is the single most important factor for the success of an AI project.

  1. Quantity is good.
  2. Quality is better.
  3. Diversity is best (to avoid bias).

Exercise: Identify the Preparation Tool

A transportation company has 1 million video clips of traffic intersections. They want to train a model to detect "Cyclists." They need to hire a team of people to watch the clips and draw boxes around the cyclists. Which AWS service should they use to manage this workflow?

  • A. Amazon Rekognition.
  • B. AWS Lambda.
  • C. Amazon SageMaker Ground Truth.
  • D. Amazon Forecast.

The Answer is C! Ground Truth is the service for managing human labeling workflows to create "Ground Truth" data for machine learning.


Knowledge Check

?Knowledge Check

In the context of the AI project lifecycle, what is 'Data Labeling'?

What's Next?

Data is ready. Now we need the "Brains." In the next lesson, we see how to pick a model and talk to it. Find out in Lesson 2: Model selection and prompt tuning.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn