
Evaluating Model ROI: The Intelligence/Price Audit
Learn how to quantify the value of your model selection. Master the metrics for 'Capability per Dollar' and build a performance-based leaderboard.
Evaluating Model ROI: The Intelligence/Price Audit
In Module 14.3, we learned how to route queries. But how do we know if our routing is actually working? What if we are routing "Logic" to a cheap model, but that model is failing 50% of the time?
If a cheap model fails, the user is unhappy, and you end up paying for a "Retry" on a more expensive model anyway. This is a Negative ROI scenario.
In this lesson, we learn how to audit your model selections. We will explore Model Comparison Frameworks, Cost-Weighted Accuracy, and the "Efficiency Threshold."
1. The Cost-Weighted Accuracy (CWA) Metric
Standard Accuracy is: Correct / Total.
Cost-Weighted Accuracy is: (Accuracy) / (Cost per 1,000 queries).
Example:
- Model A (GPT-4o): 98% Correct | $30.00 / 1M tokens.
- Model B (GPT-4o mini): 94% Correct | $0.15 / 1M tokens.
Model B is 200x cheaper, but only 4% less accurate. In most production scenarios (unless you are performing brain surgery or space navigation), Model B has a much higher ROI.
2. The "Quality Floor" Audit
You must define a Quality Floor for every feature in your app.
- "Translations must be 90% accurate."
- "Code must compile 80% of the time."
If a cheap model falls below the "Floor," you cannot use it for that task, no matter how many tokens it saves. Efficiency at the expense of functionality is a failure.
3. Implementation: The A/B Evaluation (Python)
To find the ROI, you must run the same "Eval Dataset" through your models simultaneously.
Python Code: Performance Auditor
def run_eval_comparison(test_queries, ground_truth):
models = ["gpt-4o", "gpt-4o-mini", "gemini-1.5-flash"]
results = {}
for m in models:
# Run test...
accuracy = run_test_suite(m, test_queries, ground_truth)
cost = get_current_model_pricing(m)
# CALCULATE ROI
roi_score = accuracy / (cost + 0.00001)
results[m] = {"acc": accuracy, "roi": roi_score}
return results
# Analysis:
# If 'gpt-4o-mini' has ROI > GPT-4o, switch immediately.
4. The "Model Decay" Tracker
Models get updated. Prices change. A model that was "Too Expensive" last month might have received a 50% price cut today. Efficiency ROI is a moving target. You should perform a "Price/Performance Audit" every quarter to ensure you are still using the optimized fleet.
5. Token Efficiency and "Human Review" Costs
If a cheap model produces 5% more errors, those errors have a Human Cost. If an engineer has to spend 10 minutes fixing a hallucinated bug from a cheap model, those 10 minutes are worth $20.00 in salary.
- The Calculation: If 10 minutes of human time > Total Token Savings for 1,000 queries, then the "Cheap" model is actually more expensive for the company.
6. Summary and Key Takeaways
- ROI = Accuracy / Cost: Look for the "Value Peak," not just the lowest price.
- Quality Floor: Efficiency is only valid if the output meets the minimum viable standard.
- Price/Performance Drift: Audit your choices quarterly as market prices drop.
- Factor in the Human: The most expensive "Token" in your system is actually an hour of your developer's life.
In the next lesson, Future-Proofing for Declining Token Prices, we look at چگونه to prepare for a world where tokens are "Too Cheap to Meter."
Exercise: The ROI Spreadsheet
- Take a task: "Identify names in a 1-page document."
- Estimate the cost of doing this 1 million times with your "Favorite" expert model.
- Estimate the cost with a "Flash" model.
- Determine the 'Human Subsidy' limit:
- How many errors can the Flash model make before it's cheaper to just use the Expert model and save on human debugging time?
- (Result: Usually, you can afford a LOT of errors if the cost difference is 100x).