Evaluation & Benchmarking
Quality measurement frameworks — ROUGE, BLEU, human evaluation, continuous benchmarking, model comparison, and A/B testing.
Phase: 3 (Essential for long-term success but not blocking launch)
Audience: Product managers, quality engineers, data scientists, platform teams
Purpose: Establish systematic approaches to measure harness quality, compare models, detect regressions, and optimize performance.
1. Defining Quality Metrics for Your Domain
Quality is not one-dimensional. Different harness applications require different success criteria. The first step is identifying which metrics matter.
Task-Specific Success Criteria
Define what “success” means for your harness:
| Application Domain | Primary Success Metric | Secondary Metrics |
|---|---|---|
| Code Generation | Code compiles and passes tests | Readability, style compliance, efficiency |
| Customer Support | User satisfaction, resolution rate | First-response time, escalations needed |
| Content Creation | Relevance and accuracy | Originality, engagement, brand consistency |
| Data Analysis | Correctness of insights | Clarity of explanation, actionability |
| Document Summarization | ROUGE-L/BLEU scores, human judgment | Brevity, key points retained |
| Classification | F1 score, per-class accuracy | Confidence calibration, false positive rate |
| Search/Ranking | NDCG, MRR, click-through rate | Diversity of results, relevance at rank 10 |
| Translation | BLEU/METEOR score, human judgment | Terminology consistency, cultural accuracy |
Domain Expertise Required
Some domains require specialized knowledge to evaluate:
- Code Quality: Need developers to review syntax, logic, performance, security
- Medical/Legal: Subject matter experts must verify factual accuracy
- Content: Native speakers evaluate naturalness in translation/localization
- Data Analysis: Statisticians validate methodology and conclusions
- Creative Writing: Human judgment on tone, voice, narrative flow
For specialized domains, build a reviewer pool and establish inter-rater agreement (Cohen’s kappa ≥ 0.70 for manual evaluation).
Automated vs Manual Evaluation
Choose the right approach for each metric:
| Metric | Automated | Manual | Hybrid |
|---|---|---|---|
| Syntax correctness (code) | ✓ Compiler/linter | - | - |
| Code test passage | ✓ Test runner | - | - |
| Schema compliance | ✓ JSON schema validation | - | - |
| Factual accuracy | - | ✓ Expert review | ✓ Spot-check + automation for obvious errors |
| Tone/style match | - | ✓ Human judgment | ✓ Rubric + final approval |
| Readability | ✓ Flesch-Kincaid, word count | ✓ User testing | ✓ Score + final review |
| Creativity | - | ✓ Human judges | ✓ Baseline scoring rules + judges |
| Performance | ✓ Benchmark suite | - | - |
Rule of thumb: Automate anything objective. Use humans for subjective judgment, then apply rubrics to increase consistency.
Establishing a Quality Rubric
Create a standardized rubric for manual evaluation:
Task: Code Generation Quality
Criteria:
- Correctness (40%):
- Code compiles without errors: Yes/No
- All tests pass: Yes/No
- Score: (tests_passed / total_tests) * 40
- Style (30%):
- Follows naming conventions: Yes/No (10 points)
- Proper indentation/formatting: Yes/No (10 points)
- Docstrings present and clear: Yes/No (10 points)
- Efficiency (20%):
- Time complexity acceptable: Yes/No (10 points)
- Space complexity acceptable: Yes/No (10 points)
- Readability (10%):
- Variable names self-documenting: Yes/No
- Code easy to follow: Yes/No
Final Score: Sum of weighted scores / 100
2. Quality Assessment Frameworks
A complete quality assessment uses multiple dimensions. Track them in parallel.
Task Completion Rate
The percentage of requests that produce valid, usable output.
Completion Rate = (Requests with valid output) / (Total requests) × 100
Valid Output = meets schema requirements + no timeout/error
Baseline: Most harnesses should be ≥95% completion rate at launch. <90% indicates structural issues.
Tracking:
def calculate_completion_rate(results):
valid = sum(1 for r in results if r.get('status') == 'success')
return (valid / len(results)) * 100
Error Rate by Category
Not all errors are equal. Categorize to identify patterns:
- Schema/Format Errors: Output doesn’t match expected format
- Timeout Errors: Execution exceeded time limit
- Hallucination: Model claims facts not in input
- Incomplete Output: Partial or cut-off response
- Logic Errors: Wrong reasoning or conclusion
- Resource Errors: OOM, rate limit hit
- Tool Failures: Called tools errored or unavailable
def categorize_errors(results):
errors = {}
for r in results:
if r.get('status') != 'success':
category = r.get('error_category', 'unknown')
errors[category] = errors.get(category, 0) + 1
return {k: (v / len(results)) * 100 for k, v in errors.items()}
Dashboard view:
Schema Errors: 2.3%
Timeouts: 0.1%
Hallucinations: 1.8%
Incomplete Output: 0.6%
Logic Errors: 1.2%
Total Error Rate: 6.0%
Latency (Speed)
Measure end-to-end time and components:
- Time to First Token: Perceived responsiveness (target: <2 sec)
- Total Latency: Complete request (target: <30 sec for most tasks)
- Token Generation Rate: Tokens/second (typically 30-80)
- Percentiles: p50, p95, p99 (outliers matter for user experience)
def analyze_latency(results):
latencies = [r['latency_ms'] for r in results if r.get('latency_ms')]
return {
'mean': statistics.mean(latencies),
'median': statistics.median(latencies),
'p95': np.percentile(latencies, 95),
'p99': np.percentile(latencies, 99),
'max': max(latencies)
}
Action thresholds:
- p95 > 30s: Investigate timeout handling
- p99 > 60s: Consider circuit breakers or timeouts
- Mean > Median + 5s: Check for outlier cases
Cost Per Success
Track API costs alongside quality:
Cost per Success = (Total API cost) / (Successful outputs) × 100
Cost per Quality Unit = (Total API cost) / (Quality score)
Example:
- 1000 requests, 950 successful
- API cost: $15
- Cost per success: $0.0158
- If quality score avg 85/100: Cost per quality point = $0.000176
Use this to compare models:
Claude 3 Opus: $0.0158 per success, 92% quality
Claude 3 Haiku: $0.0045 per success, 78% quality
GPT-4: $0.0312 per success, 94% quality
Trade-offs become visible: Is the 14% quality difference worth 3.5x cost?
Hallucination Detection
Hallucinations—false claims presented as facts—are critical for safety.
Detection approaches:
-
Reference-based: Compare output against ground truth
def detect_hallucinations(output, context): claims = extract_claims(output) for claim in claims: if not verify_in_context(claim, context): return True return False -
Confidence-based: Flag low-confidence generations
# During generation, track per-token probabilities hallucination_score = 1 - (mean_token_probability) if hallucination_score > 0.3: # threshold flag_for_review() -
Self-consistency: Run multiple times, compare outputs
def self_consistency_check(prompt, runs=3): outputs = [generate(prompt) for _ in range(runs)] # If outputs diverge significantly, likely hallucinating similarity = compare_outputs(outputs) if similarity < 0.7: return "HALLUCINATION_RISK" -
Human review: Sample outputs for manual verification
- Review random 5% of outputs
- Track hallucination rate by category
- Flag prompts that consistently hallucinate
Tracking:
Hallucination Detection (sample of 100 outputs):
- Fact hallucinations: 3 (3%)
- Code hallucinations: 1 (1%)
- None detected: 96 (96%)
Overall Hallucination Rate: 4%
Consistency (Same Input → Same Output Type)
Determinism is not always possible or desirable, but consistency in type and quality matters.
def measure_consistency(prompt, runs=10):
outputs = [generate(prompt) for _ in runs]
# Check type consistency
types = [get_type(o) for o in outputs]
type_consistency = types.count(types[0]) / len(types)
# Check quality consistency
scores = [score_output(o) for o in outputs]
score_variance = np.std(scores)
score_mean = np.mean(scores)
return {
'type_consistency': type_consistency,
'score_mean': score_mean,
'score_stddev': score_variance,
'acceptable': type_consistency > 0.9 and score_variance < 10
}
Action: If consistency < 90%, investigate:
- Is the prompt ambiguous?
- Are outputs actually different quality?
- Should the prompt be more prescriptive?
User Satisfaction (If Applicable)
If real users interact with harness outputs, measure satisfaction:
- Rating scale: “Was this response helpful?” (1-5 stars)
- Net Promoter Score: “Would you recommend this?” (0-10)
- Time-to-resolution: How quickly did user get answer?
- Escalation rate: % requiring human follow-up
- Re-query rate: Did user ask follow-up/reformulated question?
def calculate_satisfaction(feedback_events):
ratings = [e['rating'] for e in feedback_events if e.get('rating')]
nps_scores = [e['nps'] for e in feedback_events if e.get('nps')]
return {
'avg_rating': np.mean(ratings) if ratings else None,
'nps': np.mean(nps_scores) if nps_scores else None,
'median_resolution_time': np.median([e['time'] for e in feedback_events]),
'escalation_rate': sum(1 for e in feedback_events if e.get('escalated')) / len(feedback_events)
}
3. Continuous Benchmarking
Continuous measurement reveals trends and catches regressions early.
Benchmark Suite Creation
A good benchmark suite is:
- Representative: Covers the distribution of real usage
- Challenging: Includes edge cases and difficult examples
- Diverse: Spans difficulty levels and task variations
- Reproducible: Deterministic inputs and evaluation
- Maintainable: Easy to add new cases
Creating a benchmark:
class BenchmarkSuite:
def __init__(self, name, task_type):
self.name = name
self.task_type = task_type
self.cases = []
self.metadata = {}
def add_case(self, input_text, expected_output, difficulty='medium',
category=None, notes=''):
self.cases.append({
'input': input_text,
'expected': expected_output,
'difficulty': difficulty,
'category': category,
'notes': notes,
'id': f"{self.name}_{len(self.cases)}"
})
def to_json(self):
return {
'name': self.name,
'task_type': self.task_type,
'count': len(self.cases),
'difficulty_distribution': self._count_difficulty(),
'cases': self.cases
}
def _count_difficulty(self):
counts = {}
for case in self.cases:
d = case['difficulty']
counts[d] = counts.get(d, 0) + 1
return counts
Example benchmark for code generation:
{
"name": "python_basics",
"task_type": "code_generation",
"count": 40,
"difficulty_distribution": {
"easy": 15,
"medium": 15,
"hard": 10
},
"cases": [
{
"id": "python_basics_0",
"difficulty": "easy",
"category": "loops",
"input": "Write a function that prints numbers 1 to 10",
"expected": "def print_numbers():\n for i in range(1, 11):\n print(i)",
"notes": "Basic loop, should compile and run without error"
},
{
"id": "python_basics_1",
"difficulty": "medium",
"category": "algorithms",
"input": "Implement binary search on a sorted list",
"expected": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n ...",
"notes": "Should handle edge cases: empty list, not found, etc."
}
]
}
Baseline Establishment
Run your benchmark suite on the baseline model and record results:
def establish_baseline(harness, benchmark_suite, model_name):
results = []
for case in benchmark_suite.cases:
start = time.time()
try:
output = harness.generate(case['input'])
latency = time.time() - start
score = evaluate_output(output, case['expected'])
results.append({
'case_id': case['id'],
'output': output,
'score': score,
'latency_ms': latency * 1000,
'success': True
})
except Exception as e:
results.append({
'case_id': case['id'],
'error': str(e),
'success': False
})
baseline = {
'model': model_name,
'timestamp': datetime.now().isoformat(),
'suite': benchmark_suite.name,
'total': len(results),
'successful': sum(1 for r in results if r['success']),
'avg_score': np.mean([r['score'] for r in results if r['success']]),
'avg_latency_ms': np.mean([r['latency_ms'] for r in results if r['success']]),
'p95_latency_ms': np.percentile([r['latency_ms'] for r in results if r['success']], 95),
'results': results
}
save_baseline(baseline, f"baselines/{model_name}_{benchmark_suite.name}.json")
return baseline
Example baseline output:
Model: Claude 3 Haiku
Suite: python_basics
Timestamp: 2026-04-18T14:32:00Z
Total cases: 40
Successful: 38 (95%)
Failed: 2 (5%)
Metrics:
- Average score: 87.5 / 100
- Median score: 90 / 100
- Score variance: 12.3
- Avg latency: 1,240 ms
- p95 latency: 2,100 ms
- p99 latency: 3,500 ms
Error breakdown:
- Timeout: 1
- Schema error: 1
By difficulty:
- Easy (15 cases): 94% success, 92 avg score
- Medium (15 cases): 93% success, 87 avg score
- Hard (10 cases): 100% success, 82 avg score
Regular Re-running (Daily/Weekly)
Schedule benchmarks to run automatically:
# schedule_benchmarks.py
import schedule
import time
def run_daily_benchmarks():
"""Run at 2 AM UTC daily"""
suites = load_benchmark_suites()
for suite in suites:
baseline = load_latest_baseline(suite.name)
current = establish_baseline(harness, suite, 'latest')
# Compare and alert if regression
check_regressions(baseline, current)
# Store for trend analysis
store_result(current)
def run_weekly_analysis():
"""Run at Monday 9 AM UTC weekly"""
analyze_trends()
generate_weekly_report()
schedule.every().day.at("02:00").do(run_daily_benchmarks)
schedule.every().monday.at("09:00").do(run_weekly_analysis)
while True:
schedule.run_pending()
time.sleep(60)
Or use CI to run on every commit:
# .github/workflows/benchmark.yml
name: Continuous Benchmarking
on:
push:
branches: [main]
schedule:
- cron: '0 2 * * *' # Daily at 2 AM UTC
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run benchmarks
run: python scripts/benchmark.py --suite all --compare latest
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: results/
Regression Detection
Compare current results to baseline. Flag significant drops:
def check_regressions(baseline, current, thresholds=None):
thresholds = thresholds or {
'score_drop': 5, # Drop of 5 points = regression
'latency_increase': 20, # 20% slower = regression
'success_rate_drop': 2 # Drop of 2% = regression
}
regressions = []
# Score regression
score_delta = current['avg_score'] - baseline['avg_score']
if score_delta < -thresholds['score_drop']:
regressions.append({
'type': 'score',
'baseline': baseline['avg_score'],
'current': current['avg_score'],
'delta': score_delta,
'severity': 'critical' if score_delta < -10 else 'warning'
})
# Latency regression
latency_delta_pct = (
(current['avg_latency_ms'] - baseline['avg_latency_ms'])
/ baseline['avg_latency_ms'] * 100
)
if latency_delta_pct > thresholds['latency_increase']:
regressions.append({
'type': 'latency',
'baseline_ms': baseline['avg_latency_ms'],
'current_ms': current['avg_latency_ms'],
'delta_pct': latency_delta_pct,
'severity': 'warning' if latency_delta_pct < 50 else 'critical'
})
# Success rate regression
success_delta = (
(current['successful'] / current['total'] * 100) -
(baseline['successful'] / baseline['total'] * 100)
)
if success_delta < -thresholds['success_rate_drop']:
regressions.append({
'type': 'success_rate',
'baseline_pct': baseline['successful'] / baseline['total'] * 100,
'current_pct': current['successful'] / current['total'] * 100,
'delta_pct': success_delta,
'severity': 'critical'
})
return regressions
Alerting When Regression Detected
def alert_regression(regressions, channel='slack'):
if not regressions:
return
critical = [r for r in regressions if r['severity'] == 'critical']
warnings = [r for r in regressions if r['severity'] == 'warning']
if critical:
send_alert(
channel=channel,
level='critical',
title='Critical Quality Regression Detected',
regressions=critical
)
create_incident_issue()
if warnings:
send_alert(
channel=channel,
level='warning',
title='Quality Regression Detected',
regressions=warnings
)
def send_alert(channel, level, title, regressions):
if channel == 'slack':
slack_client.post_message(
channel='#quality-alerts',
attachments=[{
'color': 'danger' if level == 'critical' else 'warning',
'title': title,
'fields': [
{
'title': f"{r['type'].title()} Regression",
'value': f"Baseline: {r.get('baseline', r.get('baseline_pct'))} → "
f"Current: {r.get('current', r.get('current_pct'))}",
'short': False
} for r in regressions
]
}]
)
4. Comparing Models
When evaluating new models, use rigorous comparison.
A/B Testing Models on Same Tasks
Run both models on the same benchmark suite:
def compare_models(benchmark_suite, models):
"""Compare multiple models on same benchmarks"""
results = {}
for model_name in models:
harness = initialize_harness(model_name)
model_results = []
for case in benchmark_suite.cases:
output = harness.generate(case['input'])
score = evaluate_output(output, case['expected'])
model_results.append({
'case_id': case['id'],
'score': score,
'output': output
})
results[model_name] = model_results
return create_comparison_report(results, benchmark_suite)
def create_comparison_report(results, benchmark_suite):
report = {
'timestamp': datetime.now().isoformat(),
'suite': benchmark_suite.name,
'models': {}
}
for model_name, model_results in results.items():
scores = [r['score'] for r in model_results]
report['models'][model_name] = {
'mean_score': np.mean(scores),
'median_score': np.median(scores),
'stddev': np.std(scores),
'min_score': min(scores),
'max_score': max(scores),
'percentile_25': np.percentile(scores, 25),
'percentile_75': np.percentile(scores, 75)
}
# Rank models
ranked = sorted(
report['models'].items(),
key=lambda x: x[1]['mean_score'],
reverse=True
)
report['ranking'] = [m[0] for m in ranked]
return report
Example comparison:
Benchmark Suite: python_basics
Timestamp: 2026-04-18
Model Comparison (mean score):
1. Claude 3 Opus: 87.5 / 100 (stddev: 12.3)
2. Claude 3 Sonnet: 84.2 / 100 (stddev: 14.1)
3. Claude 3 Haiku: 79.8 / 100 (stddev: 16.5)
4. GPT-4 Turbo: 85.6 / 100 (stddev: 13.2)
Model Performance by Difficulty:
Easy Medium Hard
Opus: 94 87 81
Sonnet: 92 84 76
Haiku: 88 80 70
GPT-4 Turbo: 91 85 80
Statistical Significance
Don’t rely on point estimates. Determine if differences are real:
def statistical_significance_test(model_a_scores, model_b_scores, alpha=0.05):
"""Run paired t-test between two models"""
# Ensure same test cases
assert len(model_a_scores) == len(model_b_scores)
t_statistic, p_value = scipy.stats.ttest_rel(model_a_scores, model_b_scores)
is_significant = p_value < alpha
mean_diff = np.mean(model_a_scores) - np.mean(model_b_scores)
confidence_interval = scipy.stats.t.interval(
1 - alpha,
len(model_a_scores) - 1,
loc=mean_diff,
scale=scipy.stats.sem(model_a_scores - model_b_scores)
)
return {
'significant': is_significant,
'p_value': p_value,
'mean_difference': mean_diff,
'confidence_interval_95': confidence_interval,
'recommendation': (
f"Model A is significantly better" if mean_diff > 0 and is_significant
else f"Model B is significantly better" if mean_diff < 0 and is_significant
else "No significant difference detected"
)
}
# Usage
results = compare_models(suite, ['opus', 'haiku'])
opus_scores = [r['score'] for r in results['opus']]
haiku_scores = [r['score'] for r in results['haiku']]
sig_test = statistical_significance_test(opus_scores, haiku_scores)
print(f"P-value: {sig_test['p_value']:.4f}")
print(f"Mean difference: {sig_test['mean_difference']:.2f}")
print(sig_test['recommendation'])
Interpretation:
- p-value < 0.05: Difference is statistically significant (95% confidence)
- p-value ≥ 0.05: Difference could be due to random variation
Trade-offs: Faster vs More Accurate
Use a cost-quality scatter plot to visualize trade-offs:
def cost_quality_analysis(models, benchmark_suite):
"""Analyze cost vs quality trade-off"""
analysis = []
for model in models:
results = benchmark_suite.run(model)
analysis.append({
'model': model.name,
'quality_score': results['avg_score'],
'cost_per_call': model.cost_usd,
'latency_ms': results['avg_latency_ms'],
'tokens_per_call': results['avg_tokens']
})
return analysis
# Plotting
def plot_cost_quality(analysis):
models = [a['model'] for a in analysis]
costs = [a['cost_per_call'] for a in analysis]
quality = [a['quality_score'] for a in analysis]
plt.scatter(costs, quality, s=200)
for i, model in enumerate(models):
plt.annotate(model, (costs[i], quality[i]))
plt.xlabel('Cost per call ($)')
plt.ylabel('Quality score (0-100)')
plt.title('Cost vs Quality Trade-off')
plt.grid(True, alpha=0.3)
plt.show()
Example decision matrix:
Model Quality Cost/Call Latency Recommendation
Opus 92 $0.015 1800ms Use for high-stakes tasks
Sonnet 85 $0.003 1200ms Use for standard tasks
Haiku 78 $0.0008 600ms Use for high-volume, cost-sensitive
5. Leaderboard Tracking
Track model performance over time to identify progress and regressions.
Version Tracking
Maintain a leaderboard across model versions:
class PerformanceLeaderboard:
def __init__(self, benchmark_name):
self.benchmark_name = benchmark_name
self.entries = []
def add_result(self, model_version, score, latency_ms, timestamp, commit_hash=''):
entry = {
'model': model_version,
'score': score,
'latency_ms': latency_ms,
'timestamp': timestamp,
'commit': commit_hash,
'date': datetime.fromisoformat(timestamp).date()
}
self.entries.append(entry)
def get_leaderboard(self, limit=20):
"""Return top performers"""
sorted_entries = sorted(self.entries, key=lambda x: x['score'], reverse=True)
return sorted_entries[:limit]
def get_trend(self, model_version, days=30):
"""Get performance trend for a specific model"""
cutoff = datetime.now() - timedelta(days=days)
relevant = [
e for e in self.entries
if e['model'] == model_version and datetime.fromisoformat(e['timestamp']) > cutoff
]
return sorted(relevant, key=lambda x: x['timestamp'])
Example leaderboard:
Rank Model Score Latency Date Commit
─────────────────────────────────────────────────────────────────
1. claude-opus-20240229 92.5 1.2s 2026-04-18 a7f3e9
2. claude-opus-20240115 92.1 1.3s 2026-04-17 b2d8c1
3. claude-sonnet-20240229 85.3 0.8s 2026-04-18 a7f3e9
4. gpt-4-20240229 84.8 2.1s 2026-04-17 (external)
5. claude-sonnet-20240115 84.6 0.9s 2026-04-16 b2d8c1
Metric History
Store all metrics over time for trend analysis:
def store_metric_history(model, benchmark_name, metrics, timestamp):
"""Store metrics in time-series database"""
db.save({
'model': model,
'benchmark': benchmark_name,
'timestamp': timestamp,
'metrics': {
'score': metrics['avg_score'],
'latency_p50': metrics['latency_p50'],
'latency_p95': metrics['latency_p95'],
'latency_p99': metrics['latency_p99'],
'success_rate': metrics['success_rate'],
'error_rate': metrics['error_rate'],
'cost_per_success': metrics['cost_per_success']
}
})
def analyze_trend(model, benchmark_name, metric='score', days=90):
"""Analyze trend for a metric over time"""
cutoff = datetime.now() - timedelta(days=days)
history = db.query({
'model': model,
'benchmark': benchmark_name,
'timestamp': {'$gte': cutoff}
})
timestamps = [h['timestamp'] for h in history]
values = [h['metrics'][metric] for h in history]
# Linear regression to find trend
x = np.arange(len(values))
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, values)
return {
'metric': metric,
'current_value': values[-1],
'start_value': values[0],
'delta': values[-1] - values[0],
'trend': 'improving' if slope > 0 else 'declining',
'slope': slope,
'r_squared': r_value ** 2,
'statistically_significant': p_value < 0.05
}
Correlating Improvements with Changes
Link performance jumps to specific changes:
def correlate_improvements(benchmark_name, git_log, metrics_history):
"""Find which commits caused quality improvements"""
improvements = []
for i in range(1, len(metrics_history)):
prev = metrics_history[i-1]
current = metrics_history[i]
score_delta = current['metrics']['score'] - prev['metrics']['score']
if score_delta > 2: # Threshold for "improvement"
# Find commit between these measurements
commit = find_commit_between(
prev['timestamp'],
current['timestamp'],
git_log
)
improvements.append({
'timestamp': current['timestamp'],
'improvement': score_delta,
'commit': commit,
'message': commit.message if commit else None,
'author': commit.author if commit else None
})
return improvements
# Usage
improvements = correlate_improvements('python_basics', git_log, metrics_history)
for imp in improvements:
print(f"Score +{imp['improvement']:.1f} - {imp['commit'].message}")
Identifying Best Configurations
When you have many variables (model, temperature, max_tokens, prompt variation), find the best:
def find_best_configuration(benchmark_suite, config_grid):
"""Grid search across configurations"""
results = []
for config in config_grid:
harness = initialize_harness(**config)
scores = []
for case in benchmark_suite.cases:
output = harness.generate(case['input'])
score = evaluate_output(output, case['expected'])
scores.append(score)
results.append({
'config': config,
'mean_score': np.mean(scores),
'std_dev': np.std(scores),
'min_score': min(scores),
'max_score': max(scores)
})
# Rank by score, then by variance (prefer stable configs)
ranked = sorted(
results,
key=lambda x: (x['mean_score'], -x['std_dev']),
reverse=True
)
return ranked
# Example config grid
config_grid = [
{'model': 'opus', 'temperature': 0.7, 'max_tokens': 1000},
{'model': 'opus', 'temperature': 0.5, 'max_tokens': 1000},
{'model': 'opus', 'temperature': 0.7, 'max_tokens': 2000},
{'model': 'sonnet', 'temperature': 0.7, 'max_tokens': 1000},
]
best = find_best_configuration(suite, config_grid)
print(f"Best config: {best[0]['config']}")
print(f"Score: {best[0]['mean_score']:.1f} ± {best[0]['std_dev']:.1f}")
6. Prompt Optimization Through Benchmarking
Use your benchmark suite to systematically improve prompts.
Establish Baseline with Current Prompt
def baseline_prompt_optimization(benchmark_suite, current_prompt):
harness = initialize_harness(system_prompt=current_prompt)
results = []
for case in benchmark_suite.cases:
output = harness.generate(case['input'])
score = evaluate_output(output, case['expected'])
results.append({
'case_id': case['id'],
'score': score,
'output': output
})
baseline = {
'prompt': current_prompt,
'avg_score': np.mean([r['score'] for r in results]),
'results': results,
'timestamp': datetime.now().isoformat()
}
return baseline
Try Variations
Create prompt variants and test them:
def generate_prompt_variants(base_prompt, variations_config):
"""Generate prompt variations"""
variants = []
# Variation 1: Add explicit instructions
variants.append({
'name': 'explicit_instructions',
'prompt': base_prompt + "\n\nBe explicit about your reasoning steps."
})
# Variation 2: Few-shot examples
variants.append({
'name': 'few_shot',
'prompt': base_prompt + generate_few_shot_examples()
})
# Variation 3: Chain-of-thought
variants.append({
'name': 'chain_of_thought',
'prompt': base_prompt + "\n\nThink step by step."
})
# Variation 4: Structured output
variants.append({
'name': 'structured_output',
'prompt': base_prompt + "\n\nReturn results as JSON with keys: [...]"
})
# Variation 5: Role-based
variants.append({
'name': 'expert_role',
'prompt': "You are an expert " + base_prompt
})
return variants
def test_prompt_variants(benchmark_suite, variants):
"""Test all variants"""
results = {}
for variant in variants:
harness = initialize_harness(system_prompt=variant['prompt'])
variant_results = []
for case in benchmark_suite.cases:
output = harness.generate(case['input'])
score = evaluate_output(output, case['expected'])
variant_results.append(score)
results[variant['name']] = {
'avg_score': np.mean(variant_results),
'std_dev': np.std(variant_results),
'improvement': np.mean(variant_results) - baseline['avg_score'],
'scores': variant_results
}
return results
Measure Impact
def analyze_prompt_impact(baseline_score, variant_results):
"""Determine which variants improved performance"""
analysis = []
for name, metrics in variant_results.items():
analysis.append({
'variant': name,
'score': metrics['avg_score'],
'improvement': metrics['improvement'],
'improvement_pct': (metrics['improvement'] / baseline_score) * 100,
'stability': 1 - (metrics['std_dev'] / metrics['avg_score']), # Higher = more stable
'recommendation': 'adopt' if metrics['improvement'] > 1 else 'reject'
})
return sorted(analysis, key=lambda x: x['improvement'], reverse=True)
# Example output
# variant score improvement improvement% stability recommendation
# few_shot 84.3 +3.2 +3.9% 0.87 adopt
# chain_of_thought 83.1 +2.0 +2.4% 0.85 adopt
# explicit_inst 81.5 +0.2 +0.3% 0.86 reject
# structured_output 80.2 -1.0 -1.2% 0.82 reject
# expert_role 81.8 +0.5 +0.6% 0.84 reject
Iterate Toward Improvement
Combine successful variants:
def iterative_prompt_improvement(benchmark_suite, baseline_prompt, iterations=3):
current_best = baseline_prompt
best_score = evaluate_prompt(benchmark_suite, current_best)
for iteration in range(iterations):
print(f"\n--- Iteration {iteration + 1} ---")
print(f"Current best score: {best_score:.1f}")
# Generate variants based on current best
variants = generate_prompt_variants(current_best, {
'add_examples': True,
'add_structure': True,
'add_reasoning': True
})
results = test_prompt_variants(benchmark_suite, variants)
# Find best variant
best_variant = max(
results.items(),
key=lambda x: x[1]['improvement']
)
if best_variant[1]['improvement'] > 0.5:
current_best = get_variant_prompt(best_variant[0])
best_score = best_variant[1]['avg_score']
print(f"Improvement found: {best_variant[0]} (+{best_variant[1]['improvement']:.1f})")
else:
print("No improvement found. Stopping.")
break
return current_best, best_score
7. Tool Evaluation
If your harness uses multiple tools, measure their utility and reliability.
Tool Usage Metrics
def analyze_tool_usage(execution_logs, harness_name):
"""Analyze which tools are used and how often"""
tool_calls = {}
tool_errors = {}
tool_latencies = {}
for log in execution_logs:
for call in log.get('tool_calls', []):
tool_name = call['tool']
# Count usage
tool_calls[tool_name] = tool_calls.get(tool_name, 0) + 1
# Track errors
if call.get('error'):
tool_errors[tool_name] = tool_errors.get(tool_name, 0) + 1
# Track latency
latency = call.get('latency_ms', 0)
if tool_name not in tool_latencies:
tool_latencies[tool_name] = []
tool_latencies[tool_name].append(latency)
# Compute summary stats
summary = {}
for tool_name in tool_calls.keys():
total = tool_calls[tool_name]
errors = tool_errors.get(tool_name, 0)
latencies = tool_latencies.get(tool_name, [])
summary[tool_name] = {
'calls': total,
'error_rate': (errors / total) * 100,
'avg_latency_ms': np.mean(latencies),
'p95_latency_ms': np.percentile(latencies, 95),
'usage_pct': (total / sum(tool_calls.values())) * 100
}
return summary
# Example output
# Tool Calls Error% AvgLatency P95Latency Usage%
# search 450 2.1% 450ms 890ms 45%
# wikipedia 320 1.2% 380ms 750ms 32%
# calculator 150 0.0% 120ms 200ms 15%
# translate 80 5.0% 520ms 1200ms 8%
Tool Error Rates
def categorize_tool_errors(execution_logs):
"""Break down tool errors by category"""
error_categories = {}
for log in execution_logs:
for call in log.get('tool_calls', []):
if call.get('error'):
tool = call['tool']
error = call['error']
# Categorize
if 'timeout' in error.lower():
category = 'timeout'
elif 'rate limit' in error.lower():
category = 'rate_limit'
elif 'not found' in error.lower():
category = 'not_found'
elif 'invalid' in error.lower():
category = 'invalid_input'
else:
category = 'other'
key = f"{tool}_{category}"
error_categories[key] = error_categories.get(key, 0) + 1
return error_categories
Tool Latency Impact
def tool_latency_impact_analysis(execution_logs):
"""Measure how tool latency affects overall response time"""
impacts = {
'zero_tools': [],
'one_tool': [],
'two_tools': [],
'three_plus_tools': []
}
for log in execution_logs:
num_tools = len(log.get('tool_calls', []))
total_latency = log.get('latency_ms', 0)
if num_tools == 0:
impacts['zero_tools'].append(total_latency)
elif num_tools == 1:
impacts['one_tool'].append(total_latency)
elif num_tools == 2:
impacts['two_tools'].append(total_latency)
else:
impacts['three_plus_tools'].append(total_latency)
return {
k: {'mean': np.mean(v), 'p95': np.percentile(v, 95)}
for k, v in impacts.items() if v
}
# Shows latency increases with tool count
# zero_tools: mean=1200ms p95=1800ms
# one_tool: mean=2100ms p95=3500ms
# two_tools: mean=3200ms p95=5100ms
# three_plus_tools: mean=4500ms p95=7200ms
When to Deprecate/Replace Tools
Make decisions based on data:
def evaluate_tool_replacement(tool_name, current_metrics, replacement_metrics):
"""Decide whether to replace a tool"""
decision = {
'tool': tool_name,
'current': current_metrics,
'replacement': replacement_metrics,
'factors': {}
}
# 1. Error rate
if replacement_metrics['error_rate'] < current_metrics['error_rate'] * 0.8:
decision['factors']['error_reduction'] = 'favorable'
# 2. Latency
if replacement_metrics['avg_latency'] < current_metrics['avg_latency'] * 0.8:
decision['factors']['latency_improvement'] = 'favorable'
# 3. Accuracy (if available)
if replacement_metrics.get('accuracy', 100) > current_metrics.get('accuracy', 100):
decision['factors']['accuracy_improvement'] = 'favorable'
# 4. Cost
if replacement_metrics.get('cost') and replacement_metrics['cost'] < current_metrics.get('cost', float('inf')):
decision['factors']['cost_savings'] = 'favorable'
favorable = sum(1 for v in decision['factors'].values() if v == 'favorable')
if favorable >= 2:
decision['recommendation'] = 'REPLACE'
elif favorable == 1:
decision['recommendation'] = 'CONSIDER'
else:
decision['recommendation'] = 'KEEP'
return decision
8. Automated Quality Testing
Build tests that run on every commit to catch regressions immediately.
Regex/Schema Validation
def validate_schema(output, schema_definition):
"""Validate output matches expected schema"""
try:
data = json.loads(output)
jsonschema.validate(instance=data, schema=schema_definition)
return True, None
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {e}"
except jsonschema.ValidationError as e:
return False, f"Schema mismatch: {e.message}"
# Example schema
CODE_GEN_SCHEMA = {
"type": "object",
"properties": {
"code": {"type": "string"},
"explanation": {"type": "string"},
"test_cases": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["code", "explanation"],
"additionalProperties": False
}
@pytest.mark.quality
def test_schema_compliance(harness, benchmark_case):
output = harness.generate(benchmark_case['input'])
is_valid, error = validate_schema(output, CODE_GEN_SCHEMA)
assert is_valid, f"Schema validation failed: {error}"
Semantic Validation
def semantic_validation(output, context):
"""Check if output makes sense given context"""
checks = []
# 1. Non-empty output
if not output or len(output.strip()) < 10:
checks.append(('empty_output', False))
else:
checks.append(('empty_output', True))
# 2. Output addresses input (simple word overlap)
input_words = set(context['input'].lower().split())
output_words = set(output.lower().split())
overlap = len(input_words & output_words) / len(input_words)
checks.append(('input_address', overlap > 0.2))
# 3. No common error phrases
error_phrases = ['error', 'failed', 'unable', 'cannot', 'invalid']
has_errors = any(phrase in output.lower() for phrase in error_phrases)
checks.append(('no_errors', not has_errors))
# 4. Reasonable length (task-dependent)
if context.get('expected_length') == 'short':
length_ok = 50 < len(output) < 500
elif context.get('expected_length') == 'long':
length_ok = 500 < len(output) < 5000
else:
length_ok = 20 < len(output) < 10000
checks.append(('reasonable_length', length_ok))
all_pass = all(check[1] for check in checks)
return all_pass, checks
@pytest.mark.quality
def test_semantic_validity(harness, benchmark_case):
output = harness.generate(benchmark_case['input'])
is_valid, checks = semantic_validation(output, benchmark_case)
failed = [c[0] for c in checks if not c[1]]
assert is_valid, f"Semantic validation failed: {failed}"
Factual Correctness
def factual_correctness_test(output, ground_truth, threshold=0.8):
"""Compare output against ground truth using multiple methods"""
# Method 1: Exact match
if output.strip() == ground_truth.strip():
return True, 'exact_match', 1.0
# Method 2: Semantic similarity (using embeddings)
embedding_similarity = compute_embedding_similarity(output, ground_truth)
if embedding_similarity > threshold:
return True, 'semantic_similarity', embedding_similarity
# Method 3: Key concepts overlap
output_concepts = extract_key_concepts(output)
truth_concepts = extract_key_concepts(ground_truth)
concept_overlap = len(output_concepts & truth_concepts) / len(truth_concepts)
if concept_overlap > threshold:
return True, 'concept_overlap', concept_overlap
return False, 'no_match', max(embedding_similarity, concept_overlap)
@pytest.mark.quality
@pytest.mark.parametrize("case", BENCHMARK_CASES)
def test_factual_correctness(harness, case):
output = harness.generate(case['input'])
is_correct, method, score = factual_correctness_test(
output,
case['expected_output'],
threshold=0.75
)
assert is_correct, f"Factual test failed ({method}: {score:.2f})"
Style Compliance
def check_style_compliance(output, style_rules):
"""Verify output follows style guidelines"""
violations = []
# Rule 1: Tone check
if style_rules.get('tone') == 'formal':
informal_words = ['gonna', 'wanna', 'hey', 'cool', 'awesome']
if any(word in output.lower() for word in informal_words):
violations.append('informal_tone')
# Rule 2: Length constraints
if style_rules.get('max_length'):
if len(output) > style_rules['max_length']:
violations.append(f"exceeds_max_length ({len(output)} > {style_rules['max_length']})")
# Rule 3: Format requirements
if style_rules.get('format') == 'bullet_points':
if not output.strip().startswith('-') and not output.strip().startswith('•'):
violations.append('not_bullet_points')
# Rule 4: Forbidden words/phrases
if style_rules.get('forbidden_words'):
for word in style_rules['forbidden_words']:
if word.lower() in output.lower():
violations.append(f"contains_forbidden_word: {word}")
return len(violations) == 0, violations
@pytest.mark.quality
def test_style_compliance(harness, case):
output = harness.generate(case['input'])
is_compliant, violations = check_style_compliance(output, STYLE_RULES)
assert is_compliant, f"Style violations: {violations}"
9. Example Benchmark Suites
1. Coding Tasks
{
"name": "coding_fundamentals",
"task_type": "code_generation",
"language": "python",
"count": 45,
"difficulty_distribution": {
"easy": 20,
"medium": 15,
"hard": 10
},
"categories": ["strings", "loops", "functions", "data_structures", "algorithms"],
"cases": [
{
"id": "coding_001",
"difficulty": "easy",
"category": "strings",
"input": "Write a function to reverse a string without using built-in reverse function",
"expected": "def reverse_string(s):\n return s[::-1]\n # or: ''.join(reversed(s))",
"evaluation": "syntax_valid + test_cases_pass",
"test_cases": [
{"input": "hello", "expected": "olleh"},
{"input": "", "expected": ""},
{"input": "a", "expected": "a"}
]
},
{
"id": "coding_025",
"difficulty": "hard",
"category": "algorithms",
"input": "Implement a function that finds the longest increasing subsequence in O(n log n) time",
"expected": "def lis_length(nums):\n if not nums: return 0\n ...",
"evaluation": "syntax_valid + test_cases_pass + efficient",
"test_cases": [
{"input": [3,10,2,1,20], "expected": 3},
{"input": [3,3,3,3], "expected": 1},
{"input": [], "expected": 0}
],
"time_complexity": "O(n log n)",
"space_complexity": "O(n)"
}
]
}
2. Question Answering
{
"name": "qa_knowledge",
"task_type": "question_answering",
"count": 40,
"domains": ["science", "history", "geography", "literature"],
"difficulty_distribution": {
"easy": 15,
"medium": 15,
"hard": 10
},
"cases": [
{
"id": "qa_001",
"difficulty": "easy",
"domain": "science",
"question": "What is the chemical formula for water?",
"expected_answer": "H2O",
"acceptable_variations": ["h2o", "H₂O", "dihydrogen monoxide"],
"source": "general_knowledge",
"evaluation": "factual_match"
},
{
"id": "qa_025",
"difficulty": "hard",
"domain": "literature",
"question": "In 'Moby Dick', what is the name of Captain Ahab's whaleboat?",
"expected_answer": "the Pequod",
"acceptable_variations": ["pequod", "the pequod"],
"source": "moby_dick_chapter_32",
"evaluation": "factual_match + synonym_tolerance"
}
]
}
3. Summarization
{
"name": "summarization_quality",
"task_type": "summarization",
"count": 30,
"document_types": ["news", "research", "legal", "technical"],
"target_length_words": ["50", "100", "200"],
"cases": [
{
"id": "sum_001",
"difficulty": "easy",
"document_type": "news",
"source": "reuters_article_20260415",
"document": "Full article text here...",
"target_length": 100,
"reference_summary": "Company X announced quarterly earnings of $Y billion, up Z% year-over-year.",
"evaluation": "rouge_l + manual_quality_score",
"metrics": {
"rouge_l_threshold": 0.4,
"brevity_penalty": 0.9,
"key_points_required": 2
}
}
]
}
4. Classification
{
"name": "sentiment_classification",
"task_type": "classification",
"num_classes": 3,
"classes": ["positive", "neutral", "negative"],
"count": 50,
"class_distribution": {
"positive": 18,
"neutral": 16,
"negative": 16
},
"cases": [
{
"id": "sent_001",
"difficulty": "easy",
"text": "I love this product! It's amazing!",
"expected_class": "positive",
"confidence_threshold": 0.7
},
{
"id": "sent_025",
"difficulty": "hard",
"text": "The product is okay, but it could be better in some ways.",
"expected_class": "neutral",
"confidence_threshold": 0.6,
"notes": "Subjective - some might classify as mildly positive"
}
]
}
10. Reporting & Dashboards
Quality Trends Over Time
def generate_trend_report(benchmark_name, days=90):
"""Generate a comprehensive trend report"""
# Fetch metrics over time
history = fetch_metric_history(benchmark_name, days=days)
# Calculate trends
report = {
'benchmark': benchmark_name,
'period_days': days,
'measurement_count': len(history),
'metrics': {}
}
# Metrics to track
metric_keys = ['score', 'latency_p50', 'latency_p95', 'success_rate', 'cost_per_success']
for metric in metric_keys:
values = [h['metrics'][metric] for h in history]
# Linear regression
x = np.arange(len(values))
slope, intercept, r_squared, p_value, _ = scipy.stats.linregress(x, values)
report['metrics'][metric] = {
'current': values[-1],
'baseline': values[0],
'change_pct': ((values[-1] - values[0]) / values[0]) * 100,
'trend': 'improving' if slope > 0 and metric != 'latency_p50' else
'declining' if slope < 0 and metric != 'latency_p50' else
'degrading' if slope > 0 else 'improving',
'trend_strength': abs(slope) / np.std(values) if np.std(values) > 0 else 0,
'statistically_significant': p_value < 0.05
}
return report
# Example report:
# Benchmark: python_basics (90 days)
#
# Metric Current Change Trend Significant
# ──────────────────────────────────────────────────────
# Score 87.5 +4.2% improving ↗ Yes
# Latency (p50) 1240ms +8.3% degrading ↗ Yes
# Latency (p95) 2100ms +2.1% stable → No
# Success Rate 96.5% +1.2% improving ↗ No
# Cost/Success $0.015 -3.5% improving ↘ Yes
Model Comparison Charts
def plot_model_comparison_matrix(models, benchmark_suite):
"""Create side-by-side comparison charts"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle(f'Model Comparison: {benchmark_suite.name}')
# 1. Quality Score Distribution
ax = axes[0, 0]
for model in models:
results = run_benchmark(model, benchmark_suite)
scores = [r['score'] for r in results]
ax.hist(scores, alpha=0.6, label=model.name, bins=10)
ax.set_xlabel('Quality Score')
ax.set_ylabel('Frequency')
ax.set_title('Score Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
# 2. Latency Percentiles
ax = axes[0, 1]
model_names = []
p50_latencies = []
p95_latencies = []
p99_latencies = []
for model in models:
results = run_benchmark(model, benchmark_suite)
latencies = [r['latency_ms'] for r in results]
model_names.append(model.name)
p50_latencies.append(np.percentile(latencies, 50))
p95_latencies.append(np.percentile(latencies, 95))
p99_latencies.append(np.percentile(latencies, 99))
x = np.arange(len(model_names))
width = 0.25
ax.bar(x - width, p50_latencies, width, label='p50')
ax.bar(x, p95_latencies, width, label='p95')
ax.bar(x + width, p99_latencies, width, label='p99')
ax.set_xlabel('Model')
ax.set_ylabel('Latency (ms)')
ax.set_title('Latency Percentiles')
ax.set_xticks(x)
ax.set_xticklabels(model_names, rotation=45)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# 3. Cost vs Quality
ax = axes[1, 0]
for model in models:
results = run_benchmark(model, benchmark_suite)
avg_score = np.mean([r['score'] for r in results])
cost = model.cost_usd
ax.scatter(cost, avg_score, s=200, label=model.name)
ax.set_xlabel('Cost per Call ($)')
ax.set_ylabel('Quality Score')
ax.set_title('Cost vs Quality Trade-off')
ax.legend()
ax.grid(True, alpha=0.3)
# 4. Error Rates
ax = axes[1, 1]
error_rates = []
for model in models:
results = run_benchmark(model, benchmark_suite)
error_rate = sum(1 for r in results if not r.get('success')) / len(results) * 100
error_rates.append(error_rate)
ax.barh(model_names, error_rates, color='coral')
ax.set_xlabel('Error Rate (%)')
ax.set_title('Error Rates')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
return fig
Error Breakdown by Category
def generate_error_report(benchmark_results):
"""Detailed error analysis"""
errors_by_category = {}
errors_by_difficulty = {}
for result in benchmark_results:
if not result.get('success'):
category = result.get('error_category', 'unknown')
difficulty = result.get('difficulty', 'unknown')
# By category
errors_by_category[category] = errors_by_category.get(category, 0) + 1
# By difficulty
key = f"{difficulty}_{category}"
errors_by_difficulty[key] = errors_by_difficulty.get(key, 0) + 1
total_errors = sum(errors_by_category.values())
report = {
'total_errors': total_errors,
'error_rate_pct': (total_errors / len(benchmark_results)) * 100,
'by_category': {
k: {'count': v, 'pct': (v / total_errors) * 100}
for k, v in sorted(errors_by_category.items(), key=lambda x: x[1], reverse=True)
},
'by_difficulty': errors_by_difficulty
}
return report
# Example output:
# Total Errors: 42 (5.2%)
#
# By Category:
# Schema Error: 18 (42.9%)
# Timeout: 8 (19.0%)
# Hallucination: 7 (16.7%)
# Logic Error: 5 (11.9%)
# Incomplete Output: 4 (9.5%)
#
# By Difficulty:
# Easy errors: 10 (23.8%)
# Medium errors: 15 (35.7%)
# Hard errors: 17 (40.5%)
Cost vs Quality Scatter Plot
def plot_cost_quality_analysis(models_analysis):
"""Plot cost vs quality to show trade-offs"""
fig, ax = plt.subplots(figsize=(12, 8))
# Size = tokens per call, color = latency
for analysis in models_analysis:
ax.scatter(
analysis['cost_per_call'],
analysis['quality_score'],
s=analysis['tokens_per_call'] / 5, # Scale for visibility
c=analysis['latency_ms'],
cmap='viridis',
alpha=0.7,
edgecolors='black',
linewidth=1.5,
label=analysis['model']
)
# Annotate model name
ax.annotate(
analysis['model'],
(analysis['cost_per_call'], analysis['quality_score']),
xytext=(5, 5),
textcoords='offset points',
fontsize=9
)
ax.set_xlabel('Cost per Call ($)', fontsize=12)
ax.set_ylabel('Quality Score (0-100)', fontsize=12)
ax.set_title('Cost vs Quality Trade-off Analysis', fontsize=14)
ax.grid(True, alpha=0.3)
# Add colorbar for latency
cbar = plt.colorbar(ax.collections[0], ax=ax)
cbar.set_label('Latency (ms)', fontsize=11)
# Legend for bubble size
sizes = [100, 500, 1000]
for size in sizes:
ax.scatter([], [], s=size/5, c='gray', alpha=0.6, edgecolors='black')
ax.legend(scatterpoints=1, title='Tokens/call', loc='lower right')
plt.tight_layout()
return fig
Monthly/Quarterly Reports
def generate_quarterly_report(benchmark_suite, quarter):
"""Comprehensive quality report for business review"""
# Fetch data for the quarter
start_date = get_quarter_start(quarter)
end_date = get_quarter_end(quarter)
history = fetch_metric_history(benchmark_suite, start_date, end_date)
report = {
'title': f'{benchmark_suite} Quality Report - Q{quarter}',
'period': f'{start_date.strftime("%Y-%m-%d")} to {end_date.strftime("%Y-%m-%d")}',
'executive_summary': {},
'metrics': {},
'comparisons': {},
'improvements': [],
'regressions': [],
'recommendations': []
}
# Executive summary
first = history[0]
last = history[-1]
report['executive_summary'] = {
'quality_change': f"{last['metrics']['score'] - first['metrics']['score']:+.1f} points",
'quality_trend': 'improving' if last['metrics']['score'] > first['metrics']['score'] else 'declining',
'efficiency_change': f"{last['metrics']['cost_per_success'] - first['metrics']['cost_per_success']:+.1f}%",
'reliability': f"{last['metrics']['success_rate']:.1f}%",
'key_achievements': identify_achievements(history),
'priority_issues': identify_issues(history)
}
# Detailed metrics
for metric in ['score', 'success_rate', 'cost_per_success', 'latency_p95']:
values = [h['metrics'][metric] for h in history]
report['metrics'][metric] = {
'start': first['metrics'][metric],
'end': last['metrics'][metric],
'change_pct': ((last['metrics'][metric] - first['metrics'][metric]) / first['metrics'][metric]) * 100,
'mean': np.mean(values),
'std': np.std(values),
'min': min(values),
'max': max(values)
}
# Model comparisons (if testing multiple models)
# ... (comparison metrics)
# Improvements made
report['improvements'] = analyze_improvements(history)
# Regressions detected
report['regressions'] = analyze_regressions(history)
# Recommendations
report['recommendations'] = [
'Focus on latency optimization (p95 increased 15%)',
'Review error categories - schema errors up from 2% to 4%',
'Consider model upgrade - quality gains plateau',
'Implement caching for high-latency operations'
]
return report
Best Practices Summary
-
Define Clear Metrics: Start with task-specific success criteria. Not one-size-fits-all.
-
Automate Everything Possible: Regex validation, schema checks, syntax verification. Save human review for subjective calls.
-
Establish Baselines: You can’t measure improvement without knowing where you started.
-
Run Benchmarks Regularly: Daily or weekly, depending on release cadence. Catch regressions early.
-
Use Representative Test Cases: 40-50 well-chosen cases beat 500 random ones. Include edge cases.
-
Statistical Rigor: Not all improvements are real. Use significance testing before declaring victory.
-
Correlate Changes with Improvements: Link commits to performance jumps to identify what works.
-
Track Multiple Dimensions: Quality alone isn’t enough. Monitor cost, latency, and reliability too.
-
Create Clear Dashboards: Trends matter more than point values. Show before/after, not just latest.
-
Report to Stakeholders: Monthly summaries help product managers understand quality trajectory and ROI.
Implementation Checklist
- Define domain-specific success criteria
- Create initial benchmark suite (30-50 representative cases)
- Build automated validation tests (schema, syntax, semantic)
- Establish baseline metrics on current model
- Set up daily benchmark runs (CI or scheduled task)
- Create regression detection alerts
- Build trend analysis dashboard
- Implement A/B testing framework for models
- Create prompt variation testing harness
- Set up tool evaluation metrics
- Generate first monthly quality report
- Document benchmark maintenance procedures
- Train team on reading/interpreting reports
Validation Checklist
How do you know you got this right?
Performance Checks
- Benchmark suite runs in <5 minutes (40-50 cases, automated evaluation)
- Metrics calculated in <30 seconds from benchmark data
- Dashboard refreshes daily with <1 minute latency
- Regression detection triggers within 5% deviation from baseline
Implementation Checks
- Domain-specific success metrics defined (code: tests pass, content: relevance, etc)
- Benchmark suite created: 40-50 representative cases including edge cases
- Automated evaluation for 70%+ of metrics (schema, syntax, correctness)
- Baseline established: before/after state clear for all metrics
- Daily benchmark runs configured and executing reliably
- A/B testing framework: can switch models and measure impact
- Trend analysis working: can spot improvement/regression over time
Integration Checks
- Benchmark connects to harness: can run against current version automatically
- Quality metrics flow to dashboard: real-time monitoring visible
- Regression alerts configured: team notified of >5% quality drop
- Cost metrics tracked alongside quality: cost per quality point visible
- Prompt evaluation: can measure impact of prompt changes (doc 15)
Common Failure Modes
- Benchmark not representative: Test cases don’t match real usage distribution
- Metrics not actionable: Dashboard shows numbers but unclear what to do
- Regression detection too sensitive: False alarms for 1% variance
- Manual evaluation bottleneck: Can’t scale to monitor all outputs
- Baseline stale: Original baseline never updated, making progress invisible
Sign-Off Criteria
- Benchmark suite running daily on CI/scheduler
- Baseline metrics established and documented
- Dashboard showing trends over 2+ weeks
- A/B test completed: measured impact of one model or prompt change
- First regression detected and handled (false alarm or real improvement)
- Team understands metrics and dashboard interpretation
See Also
- Doc 15 (Prompt Engineering): Measure prompt optimization impact with these metrics
- Doc 03 (Hugging Face): Model selection validated through benchmarking
- Doc 05 (AI Agents): Framework choice validated by completion rate metrics