The Scorecard: Evaluating Fine-tuned Models

The Scorecard: Evaluating Fine-tuned Models

Is it actually better? Learn how to use Amazon Bedrock Model Evaluation to objectively measure the accuracy, safety, and performance of your custom AI models.

Judging the Student

You have spent thousands of dollars fine-tuning an Amazon Titan model on your company’s legal data. The training status says "Complete." But is the model actually better than the base version? Does it follow your brand voice, or has it become "hallucination-prone" after the tuning?

In the AWS Certified Generative AI Developer – Professional exam, you must demonstrate competence in Model Evaluation. You cannot rely on "vibe-based" testing; you need cold, hard metrics.


1. Amazon Bedrock Model Evaluation

Bedrock provides a managed service to evaluate models (both base models and your custom fine-tuned versions). You can choose between two primary modes:

A. Automatic Evaluation

  • How it works: Uses a "Judge Model" (usually a very smart model like Claude 3 Opus) or mathematical formulas to score a set of responses.
  • Metrics:
    • ROUGE/BLEU: Measures how similar the generated text is to a "Ground Truth" answer.
    • Coherence: How well the sentences flow.
    • Toxicity: Does the custom model stay within safety bounds?

B. Human Evaluation

  • How it works: You select a workforce (internal team or AWS experts) to manually rank responses.
  • Why?: Humans are better at judging nuance, humor, and subtle brand tone that a math formula might miss.

2. The Evaluation Metrics for Developers

MetricTarget Use CaseExplanation
AccuracyClassificationDid the model pick the correct category?
GroundednessRAG / SummaryDid the model stay true to the provided facts without hallucinating?
LatencyProduction ChatHow many milliseconds did it take to generate the first token?
ToxicitySafetyDid the model bypass your safety guardrails?

3. Evaluation Process: The Model "Fight Club"

A common professional strategy is the Model-to-Model Comparison (AB Testing).

  1. Feed the same 100 prompts to both the Base Model and your Fine-tuned Model.
  2. Ask a human (or a "High-Elo" judge model) to pick which answer is better.
  3. The Result: You get a "% Win Rate" for your custom model. If the win rate is < 60%, your fine-tuning might not be worth the cost.
graph TD
    P[Prompt Dataset] --> M1[Base Model]
    P --> M2[Fine-tuned Model]
    M1 --> J{Evaluation Engine: Bedrock/Human}
    M2 --> J
    J --> R[Scorecard: Which is better?]

4. Using SageMaker Model Monitor for Drift

Evaluation doesn't stop at launch. As we learned in Module 11, models can "drift."

  • Once your fine-tuned model is live on SageMaker, use Model Monitor to continuously evaluate the live data against your initial training metrics.
  • If the model's accuracy drops below your "Baseline Score," it’s time for another tuning run.

5. Identifying Overfitting through Evaluation

In the exam, you will encounter a scenario where a fine-tuned model works perfectly on the training data but fails on real-world queries. This is Overfitting.

  • The Solution: During evaluation, always use a "Hold-out Set"—data the model has NEVER seen during training. If the model fails the hold-out set, your training process was too narrow.

6. Code Example: Triggering an Evaluation Job

import boto3

client = boto3.client('bedrock')

def start_evaluation():
    response = client.create_model_evaluation_job(
        jobName='MyLegalModelEval-001',
        evaluationConfig={
            'automated': {
                'datasetArn': 'arn:aws:s3:::my-test-data/eval_set.jsonl',
                'metrics': ['accuracy', 'robustness']
            }
        },
        inferenceInterface={
             # The fine-tuned model to test
            'models': [{'modelIdentifier': 'arn:aws:bedrock:us-east-1:123:custom-model/legal-v2'}]
        }
    )
    return response['jobArn']

Knowledge Check: Test Your Evaluation Knowledge

?Knowledge Check

A developer has fine-tuned a model for customer sentiment analysis. They want to objectively measure how closely the model's summaries match a set of high-quality 'Gold Standard' summaries written by human experts. Which automated metric should they use?


Summary

Evaluation is the "Reality Check" of AI development. It proves value and identifies risks. This concludes Module 13. In the final module of Domain 4, we move to Cost and Performance Optimization—making your AI faster and cheaper.


Next Module: The Lean AI: Optimizing Token Usage and Costs

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn