Judging the Student

You have spent thousands of dollars fine-tuning an Amazon Titan model on your company’s legal data. The training status says "Complete." But is the model actually better than the base version? Does it follow your brand voice, or has it become "hallucination-prone" after the tuning?

In the AWS Certified Generative AI Developer – Professional exam, you must demonstrate competence in Model Evaluation. You cannot rely on "vibe-based" testing; you need cold, hard metrics.

1. Amazon Bedrock Model Evaluation

Bedrock provides a managed service to evaluate models (both base models and your custom fine-tuned versions). You can choose between two primary modes:

A. Automatic Evaluation

How it works: Uses a "Judge Model" (usually a very smart model like Claude 3 Opus) or mathematical formulas to score a set of responses.
Metrics:
- ROUGE/BLEU: Measures how similar the generated text is to a "Ground Truth" answer.
- Coherence: How well the sentences flow.
- Toxicity: Does the custom model stay within safety bounds?

B. Human Evaluation

How it works: You select a workforce (internal team or AWS experts) to manually rank responses.
Why?: Humans are better at judging nuance, humor, and subtle brand tone that a math formula might miss.

2. The Evaluation Metrics for Developers

Metric	Target Use Case	Explanation
Accuracy	Classification	Did the model pick the correct category?
Groundedness	RAG / Summary	Did the model stay true to the provided facts without hallucinating?
Latency	Production Chat	How many milliseconds did it take to generate the first token?
Toxicity	Safety	Did the model bypass your safety guardrails?

3. Evaluation Process: The Model "Fight Club"

A common professional strategy is the Model-to-Model Comparison (AB Testing).

Feed the same 100 prompts to both the Base Model and your Fine-tuned Model.
Ask a human (or a "High-Elo" judge model) to pick which answer is better.
The Result: You get a "% Win Rate" for your custom model. If the win rate is < 60%, your fine-tuning might not be worth the cost.

graph TD
    P[Prompt Dataset] --> M1[Base Model]
    P --> M2[Fine-tuned Model]
    M1 --> J{Evaluation Engine: Bedrock/Human}
    M2 --> J
    J --> R[Scorecard: Which is better?]

4. Using SageMaker Model Monitor for Drift

Evaluation doesn't stop at launch. As we learned in Module 11, models can "drift."

Once your fine-tuned model is live on SageMaker, use Model Monitor to continuously evaluate the live data against your initial training metrics.
If the model's accuracy drops below your "Baseline Score," it’s time for another tuning run.

5. Identifying Overfitting through Evaluation

In the exam, you will encounter a scenario where a fine-tuned model works perfectly on the training data but fails on real-world queries. This is Overfitting.

The Solution: During evaluation, always use a "Hold-out Set"—data the model has NEVER seen during training. If the model fails the hold-out set, your training process was too narrow.

6. Code Example: Triggering an Evaluation Job

import boto3

client = boto3.client('bedrock')

def start_evaluation():
    response = client.create_model_evaluation_job(
        jobName='MyLegalModelEval-001',
        evaluationConfig={
            'automated': {
                'datasetArn': 'arn:aws:s3:::my-test-data/eval_set.jsonl',
                'metrics': ['accuracy', 'robustness']
            }
        },
        inferenceInterface={
             # The fine-tuned model to test
            'models': [{'modelIdentifier': 'arn:aws:bedrock:us-east-1:123:custom-model/legal-v2'}]
        }
    )
    return response['jobArn']

Knowledge Check: Test Your Evaluation Knowledge

Error: Quiz options are missing or invalid.

Summary

Evaluation is the "Reality Check" of AI development. It proves value and identifies risks. This concludes Module 13. In the final module of Domain 4, we move to Cost and Performance Optimization—making your AI faster and cheaper.

Next Module: The Lean AI: Optimizing Token Usage and Costs

The Scorecard: Evaluating Fine-tuned Models