Evaluation & Benchmarking — The Harness Handbook Reference

Phase: 3 (Essential for long-term success but not blocking launch)
Audience: Product managers, quality engineers, data scientists, platform teams
Purpose: Establish systematic approaches to measure harness quality, compare models, detect regressions, and optimize performance.

1. Defining Quality Metrics for Your Domain

Quality is not one-dimensional. Different harness applications require different success criteria. The first step is identifying which metrics matter.

Task-Specific Success Criteria

Define what “success” means for your harness:

Application Domain	Primary Success Metric	Secondary Metrics
Code Generation	Code compiles and passes tests	Readability, style compliance, efficiency
Customer Support	User satisfaction, resolution rate	First-response time, escalations needed
Content Creation	Relevance and accuracy	Originality, engagement, brand consistency
Data Analysis	Correctness of insights	Clarity of explanation, actionability
Document Summarization	ROUGE-L/BLEU scores, human judgment	Brevity, key points retained
Classification	F1 score, per-class accuracy	Confidence calibration, false positive rate
Search/Ranking	NDCG, MRR, click-through rate	Diversity of results, relevance at rank 10
Translation	BLEU/METEOR score, human judgment	Terminology consistency, cultural accuracy

Domain Expertise Required

Some domains require specialized knowledge to evaluate:

Code Quality: Need developers to review syntax, logic, performance, security
Medical/Legal: Subject matter experts must verify factual accuracy
Content: Native speakers evaluate naturalness in translation/localization
Data Analysis: Statisticians validate methodology and conclusions
Creative Writing: Human judgment on tone, voice, narrative flow

For specialized domains, build a reviewer pool and establish inter-rater agreement (Cohen’s kappa ≥ 0.70 for manual evaluation).

Automated vs Manual Evaluation

Choose the right approach for each metric:

Metric	Automated	Manual	Hybrid
Syntax correctness (code)	✓ Compiler/linter	-	-
Code test passage	✓ Test runner	-	-
Schema compliance	✓ JSON schema validation	-	-
Factual accuracy	-	✓ Expert review	✓ Spot-check + automation for obvious errors
Tone/style match	-	✓ Human judgment	✓ Rubric + final approval
Readability	✓ Flesch-Kincaid, word count	✓ User testing	✓ Score + final review
Creativity	-	✓ Human judges	✓ Baseline scoring rules + judges
Performance	✓ Benchmark suite	-	-

Rule of thumb: Automate anything objective. Use humans for subjective judgment, then apply rubrics to increase consistency.

Establishing a Quality Rubric

Create a standardized rubric for manual evaluation:

Task: Code Generation Quality
Criteria:
  - Correctness (40%):
      - Code compiles without errors: Yes/No
      - All tests pass: Yes/No
      - Score: (tests_passed / total_tests) * 40
  - Style (30%):
      - Follows naming conventions: Yes/No (10 points)
      - Proper indentation/formatting: Yes/No (10 points)
      - Docstrings present and clear: Yes/No (10 points)
  - Efficiency (20%):
      - Time complexity acceptable: Yes/No (10 points)
      - Space complexity acceptable: Yes/No (10 points)
  - Readability (10%):
      - Variable names self-documenting: Yes/No
      - Code easy to follow: Yes/No

Final Score: Sum of weighted scores / 100

2. Quality Assessment Frameworks

A complete quality assessment uses multiple dimensions. Track them in parallel.

Task Completion Rate

The percentage of requests that produce valid, usable output.

Completion Rate = (Requests with valid output) / (Total requests) × 100

Valid Output = meets schema requirements + no timeout/error

Baseline: Most harnesses should be ≥95% completion rate at launch. <90% indicates structural issues.

Tracking:

def calculate_completion_rate(results):
    valid = sum(1 for r in results if r.get('status') == 'success')
    return (valid / len(results)) * 100

Error Rate by Category

Not all errors are equal. Categorize to identify patterns:

Schema/Format Errors: Output doesn’t match expected format
Timeout Errors: Execution exceeded time limit
Hallucination: Model claims facts not in input
Incomplete Output: Partial or cut-off response
Logic Errors: Wrong reasoning or conclusion
Resource Errors: OOM, rate limit hit
Tool Failures: Called tools errored or unavailable

def categorize_errors(results):
    errors = {}
    for r in results:
        if r.get('status') != 'success':
            category = r.get('error_category', 'unknown')
            errors[category] = errors.get(category, 0) + 1
    return {k: (v / len(results)) * 100 for k, v in errors.items()}

Dashboard view:

Schema Errors:        2.3%
Timeouts:             0.1%
Hallucinations:       1.8%
Incomplete Output:    0.6%
Logic Errors:         1.2%
Total Error Rate:     6.0%

Latency (Speed)

Measure end-to-end time and components:

Time to First Token: Perceived responsiveness (target: <2 sec)
Total Latency: Complete request (target: <30 sec for most tasks)
Token Generation Rate: Tokens/second (typically 30-80)
Percentiles: p50, p95, p99 (outliers matter for user experience)

def analyze_latency(results):
    latencies = [r['latency_ms'] for r in results if r.get('latency_ms')]
    return {
        'mean': statistics.mean(latencies),
        'median': statistics.median(latencies),
        'p95': np.percentile(latencies, 95),
        'p99': np.percentile(latencies, 99),
        'max': max(latencies)
    }

Action thresholds:

p95 > 30s: Investigate timeout handling
p99 > 60s: Consider circuit breakers or timeouts
Mean > Median + 5s: Check for outlier cases

Cost Per Success

Track API costs alongside quality:

Cost per Success = (Total API cost) / (Successful outputs) × 100

Cost per Quality Unit = (Total API cost) / (Quality score)

Example:

1000 requests, 950 successful
API cost: $15
Cost per success: $0.0158
If quality score avg 85/100: Cost per quality point = $0.000176

Use this to compare models:

Claude 3 Opus: $0.0158 per success, 92% quality
Claude 3 Haiku: $0.0045 per success, 78% quality
GPT-4: $0.0312 per success, 94% quality

Trade-offs become visible: Is the 14% quality difference worth 3.5x cost?

Hallucination Detection

Hallucinations—false claims presented as facts—are critical for safety.

Detection approaches:

Reference-based: Compare output against ground truth

def detect_hallucinations(output, context):
    claims = extract_claims(output)
    for claim in claims:
        if not verify_in_context(claim, context):
            return True
    return False

Confidence-based: Flag low-confidence generations

# During generation, track per-token probabilities
hallucination_score = 1 - (mean_token_probability)
if hallucination_score > 0.3:  # threshold
    flag_for_review()

Self-consistency: Run multiple times, compare outputs

def self_consistency_check(prompt, runs=3):
    outputs = [generate(prompt) for _ in range(runs)]
    # If outputs diverge significantly, likely hallucinating
    similarity = compare_outputs(outputs)
    if similarity < 0.7:
        return "HALLUCINATION_RISK"

Human review: Sample outputs for manual verification
- Review random 5% of outputs
- Track hallucination rate by category
- Flag prompts that consistently hallucinate

Tracking:

Hallucination Detection (sample of 100 outputs):
  - Fact hallucinations: 3 (3%)
  - Code hallucinations: 1 (1%)
  - None detected: 96 (96%)
Overall Hallucination Rate: 4%

Consistency (Same Input → Same Output Type)

Determinism is not always possible or desirable, but consistency in type and quality matters.

def measure_consistency(prompt, runs=10):
    outputs = [generate(prompt) for _ in runs]
    
    # Check type consistency
    types = [get_type(o) for o in outputs]
    type_consistency = types.count(types[0]) / len(types)
    
    # Check quality consistency
    scores = [score_output(o) for o in outputs]
    score_variance = np.std(scores)
    score_mean = np.mean(scores)
    
    return {
        'type_consistency': type_consistency,
        'score_mean': score_mean,
        'score_stddev': score_variance,
        'acceptable': type_consistency > 0.9 and score_variance < 10
    }

Action: If consistency < 90%, investigate:

Is the prompt ambiguous?
Are outputs actually different quality?
Should the prompt be more prescriptive?

User Satisfaction (If Applicable)

If real users interact with harness outputs, measure satisfaction:

Rating scale: “Was this response helpful?” (1-5 stars)
Net Promoter Score: “Would you recommend this?” (0-10)
Time-to-resolution: How quickly did user get answer?
Escalation rate: % requiring human follow-up
Re-query rate: Did user ask follow-up/reformulated question?

def calculate_satisfaction(feedback_events):
    ratings = [e['rating'] for e in feedback_events if e.get('rating')]
    nps_scores = [e['nps'] for e in feedback_events if e.get('nps')]
    
    return {
        'avg_rating': np.mean(ratings) if ratings else None,
        'nps': np.mean(nps_scores) if nps_scores else None,
        'median_resolution_time': np.median([e['time'] for e in feedback_events]),
        'escalation_rate': sum(1 for e in feedback_events if e.get('escalated')) / len(feedback_events)
    }

3. Continuous Benchmarking

Continuous measurement reveals trends and catches regressions early.

Benchmark Suite Creation

A good benchmark suite is:

Representative: Covers the distribution of real usage
Challenging: Includes edge cases and difficult examples
Diverse: Spans difficulty levels and task variations
Reproducible: Deterministic inputs and evaluation
Maintainable: Easy to add new cases

Creating a benchmark:

class BenchmarkSuite:
    def __init__(self, name, task_type):
        self.name = name
        self.task_type = task_type
        self.cases = []
        self.metadata = {}
    
    def add_case(self, input_text, expected_output, difficulty='medium', 
                 category=None, notes=''):
        self.cases.append({
            'input': input_text,
            'expected': expected_output,
            'difficulty': difficulty,
            'category': category,
            'notes': notes,
            'id': f"{self.name}_{len(self.cases)}"
        })
    
    def to_json(self):
        return {
            'name': self.name,
            'task_type': self.task_type,
            'count': len(self.cases),
            'difficulty_distribution': self._count_difficulty(),
            'cases': self.cases
        }
    
    def _count_difficulty(self):
        counts = {}
        for case in self.cases:
            d = case['difficulty']
            counts[d] = counts.get(d, 0) + 1
        return counts

Example benchmark for code generation:

{
  "name": "python_basics",
  "task_type": "code_generation",
  "count": 40,
  "difficulty_distribution": {
    "easy": 15,
    "medium": 15,
    "hard": 10
  },
  "cases": [
    {
      "id": "python_basics_0",
      "difficulty": "easy",
      "category": "loops",
      "input": "Write a function that prints numbers 1 to 10",
      "expected": "def print_numbers():\n    for i in range(1, 11):\n        print(i)",
      "notes": "Basic loop, should compile and run without error"
    },
    {
      "id": "python_basics_1",
      "difficulty": "medium",
      "category": "algorithms",
      "input": "Implement binary search on a sorted list",
      "expected": "def binary_search(arr, target):\n    left, right = 0, len(arr) - 1\n    ...",
      "notes": "Should handle edge cases: empty list, not found, etc."
    }
  ]
}

Baseline Establishment

Run your benchmark suite on the baseline model and record results:

def establish_baseline(harness, benchmark_suite, model_name):
    results = []
    for case in benchmark_suite.cases:
        start = time.time()
        try:
            output = harness.generate(case['input'])
            latency = time.time() - start
            
            score = evaluate_output(output, case['expected'])
            results.append({
                'case_id': case['id'],
                'output': output,
                'score': score,
                'latency_ms': latency * 1000,
                'success': True
            })
        except Exception as e:
            results.append({
                'case_id': case['id'],
                'error': str(e),
                'success': False
            })
    
    baseline = {
        'model': model_name,
        'timestamp': datetime.now().isoformat(),
        'suite': benchmark_suite.name,
        'total': len(results),
        'successful': sum(1 for r in results if r['success']),
        'avg_score': np.mean([r['score'] for r in results if r['success']]),
        'avg_latency_ms': np.mean([r['latency_ms'] for r in results if r['success']]),
        'p95_latency_ms': np.percentile([r['latency_ms'] for r in results if r['success']], 95),
        'results': results
    }
    
    save_baseline(baseline, f"baselines/{model_name}_{benchmark_suite.name}.json")
    return baseline

Example baseline output:

Model: Claude 3 Haiku
Suite: python_basics
Timestamp: 2026-04-18T14:32:00Z

Total cases: 40
Successful: 38 (95%)
Failed: 2 (5%)

Metrics:
  - Average score: 87.5 / 100
  - Median score: 90 / 100
  - Score variance: 12.3
  - Avg latency: 1,240 ms
  - p95 latency: 2,100 ms
  - p99 latency: 3,500 ms

Error breakdown:
  - Timeout: 1
  - Schema error: 1

By difficulty:
  - Easy (15 cases): 94% success, 92 avg score
  - Medium (15 cases): 93% success, 87 avg score
  - Hard (10 cases): 100% success, 82 avg score

Regular Re-running (Daily/Weekly)

Schedule benchmarks to run automatically:

# schedule_benchmarks.py
import schedule
import time

def run_daily_benchmarks():
    """Run at 2 AM UTC daily"""
    suites = load_benchmark_suites()
    for suite in suites:
        baseline = load_latest_baseline(suite.name)
        current = establish_baseline(harness, suite, 'latest')
        
        # Compare and alert if regression
        check_regressions(baseline, current)
        
        # Store for trend analysis
        store_result(current)

def run_weekly_analysis():
    """Run at Monday 9 AM UTC weekly"""
    analyze_trends()
    generate_weekly_report()

schedule.every().day.at("02:00").do(run_daily_benchmarks)
schedule.every().monday.at("09:00").do(run_weekly_analysis)

while True:
    schedule.run_pending()
    time.sleep(60)

Or use CI to run on every commit:

# .github/workflows/benchmark.yml
name: Continuous Benchmarking
on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM UTC

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run benchmarks
        run: python scripts/benchmark.py --suite all --compare latest
      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: benchmark-results
          path: results/

Regression Detection

Compare current results to baseline. Flag significant drops:

def check_regressions(baseline, current, thresholds=None):
    thresholds = thresholds or {
        'score_drop': 5,        # Drop of 5 points = regression
        'latency_increase': 20, # 20% slower = regression
        'success_rate_drop': 2  # Drop of 2% = regression
    }
    
    regressions = []
    
    # Score regression
    score_delta = current['avg_score'] - baseline['avg_score']
    if score_delta < -thresholds['score_drop']:
        regressions.append({
            'type': 'score',
            'baseline': baseline['avg_score'],
            'current': current['avg_score'],
            'delta': score_delta,
            'severity': 'critical' if score_delta < -10 else 'warning'
        })
    
    # Latency regression
    latency_delta_pct = (
        (current['avg_latency_ms'] - baseline['avg_latency_ms']) 
        / baseline['avg_latency_ms'] * 100
    )
    if latency_delta_pct > thresholds['latency_increase']:
        regressions.append({
            'type': 'latency',
            'baseline_ms': baseline['avg_latency_ms'],
            'current_ms': current['avg_latency_ms'],
            'delta_pct': latency_delta_pct,
            'severity': 'warning' if latency_delta_pct < 50 else 'critical'
        })
    
    # Success rate regression
    success_delta = (
        (current['successful'] / current['total'] * 100) - 
        (baseline['successful'] / baseline['total'] * 100)
    )
    if success_delta < -thresholds['success_rate_drop']:
        regressions.append({
            'type': 'success_rate',
            'baseline_pct': baseline['successful'] / baseline['total'] * 100,
            'current_pct': current['successful'] / current['total'] * 100,
            'delta_pct': success_delta,
            'severity': 'critical'
        })
    
    return regressions

Alerting When Regression Detected

def alert_regression(regressions, channel='slack'):
    if not regressions:
        return
    
    critical = [r for r in regressions if r['severity'] == 'critical']
    warnings = [r for r in regressions if r['severity'] == 'warning']
    
    if critical:
        send_alert(
            channel=channel,
            level='critical',
            title='Critical Quality Regression Detected',
            regressions=critical
        )
        create_incident_issue()
    
    if warnings:
        send_alert(
            channel=channel,
            level='warning',
            title='Quality Regression Detected',
            regressions=warnings
        )

def send_alert(channel, level, title, regressions):
    if channel == 'slack':
        slack_client.post_message(
            channel='#quality-alerts',
            attachments=[{
                'color': 'danger' if level == 'critical' else 'warning',
                'title': title,
                'fields': [
                    {
                        'title': f"{r['type'].title()} Regression",
                        'value': f"Baseline: {r.get('baseline', r.get('baseline_pct'))} → "
                                 f"Current: {r.get('current', r.get('current_pct'))}",
                        'short': False
                    } for r in regressions
                ]
            }]
        )

4. Comparing Models

When evaluating new models, use rigorous comparison.

A/B Testing Models on Same Tasks

Run both models on the same benchmark suite:

def compare_models(benchmark_suite, models):
    """Compare multiple models on same benchmarks"""
    results = {}
    
    for model_name in models:
        harness = initialize_harness(model_name)
        model_results = []
        
        for case in benchmark_suite.cases:
            output = harness.generate(case['input'])
            score = evaluate_output(output, case['expected'])
            model_results.append({
                'case_id': case['id'],
                'score': score,
                'output': output
            })
        
        results[model_name] = model_results
    
    return create_comparison_report(results, benchmark_suite)

def create_comparison_report(results, benchmark_suite):
    report = {
        'timestamp': datetime.now().isoformat(),
        'suite': benchmark_suite.name,
        'models': {}
    }
    
    for model_name, model_results in results.items():
        scores = [r['score'] for r in model_results]
        report['models'][model_name] = {
            'mean_score': np.mean(scores),
            'median_score': np.median(scores),
            'stddev': np.std(scores),
            'min_score': min(scores),
            'max_score': max(scores),
            'percentile_25': np.percentile(scores, 25),
            'percentile_75': np.percentile(scores, 75)
        }
    
    # Rank models
    ranked = sorted(
        report['models'].items(),
        key=lambda x: x[1]['mean_score'],
        reverse=True
    )
    report['ranking'] = [m[0] for m in ranked]
    
    return report

Example comparison:

Benchmark Suite: python_basics
Timestamp: 2026-04-18

Model Comparison (mean score):
1. Claude 3 Opus:   87.5 / 100  (stddev: 12.3)
2. Claude 3 Sonnet: 84.2 / 100  (stddev: 14.1)
3. Claude 3 Haiku:  79.8 / 100  (stddev: 16.5)
4. GPT-4 Turbo:     85.6 / 100  (stddev: 13.2)

Model Performance by Difficulty:
                Easy    Medium  Hard
Opus:           94      87      81
Sonnet:         92      84      76
Haiku:          88      80      70
GPT-4 Turbo:    91      85      80

Statistical Significance

Don’t rely on point estimates. Determine if differences are real:

def statistical_significance_test(model_a_scores, model_b_scores, alpha=0.05):
    """Run paired t-test between two models"""
    # Ensure same test cases
    assert len(model_a_scores) == len(model_b_scores)
    
    t_statistic, p_value = scipy.stats.ttest_rel(model_a_scores, model_b_scores)
    
    is_significant = p_value < alpha
    mean_diff = np.mean(model_a_scores) - np.mean(model_b_scores)
    confidence_interval = scipy.stats.t.interval(
        1 - alpha,
        len(model_a_scores) - 1,
        loc=mean_diff,
        scale=scipy.stats.sem(model_a_scores - model_b_scores)
    )
    
    return {
        'significant': is_significant,
        'p_value': p_value,
        'mean_difference': mean_diff,
        'confidence_interval_95': confidence_interval,
        'recommendation': (
            f"Model A is significantly better" if mean_diff > 0 and is_significant
            else f"Model B is significantly better" if mean_diff < 0 and is_significant
            else "No significant difference detected"
        )
    }

# Usage
results = compare_models(suite, ['opus', 'haiku'])
opus_scores = [r['score'] for r in results['opus']]
haiku_scores = [r['score'] for r in results['haiku']]

sig_test = statistical_significance_test(opus_scores, haiku_scores)
print(f"P-value: {sig_test['p_value']:.4f}")
print(f"Mean difference: {sig_test['mean_difference']:.2f}")
print(sig_test['recommendation'])

Interpretation:

p-value < 0.05: Difference is statistically significant (95% confidence)
p-value ≥ 0.05: Difference could be due to random variation

Trade-offs: Faster vs More Accurate

Use a cost-quality scatter plot to visualize trade-offs:

def cost_quality_analysis(models, benchmark_suite):
    """Analyze cost vs quality trade-off"""
    analysis = []
    
    for model in models:
        results = benchmark_suite.run(model)
        
        analysis.append({
            'model': model.name,
            'quality_score': results['avg_score'],
            'cost_per_call': model.cost_usd,
            'latency_ms': results['avg_latency_ms'],
            'tokens_per_call': results['avg_tokens']
        })
    
    return analysis

# Plotting
def plot_cost_quality(analysis):
    models = [a['model'] for a in analysis]
    costs = [a['cost_per_call'] for a in analysis]
    quality = [a['quality_score'] for a in analysis]
    
    plt.scatter(costs, quality, s=200)
    for i, model in enumerate(models):
        plt.annotate(model, (costs[i], quality[i]))
    
    plt.xlabel('Cost per call ($)')
    plt.ylabel('Quality score (0-100)')
    plt.title('Cost vs Quality Trade-off')
    plt.grid(True, alpha=0.3)
    plt.show()

Example decision matrix:

Model           Quality  Cost/Call  Latency  Recommendation
Opus            92       $0.015     1800ms   Use for high-stakes tasks
Sonnet          85       $0.003     1200ms   Use for standard tasks
Haiku           78       $0.0008    600ms    Use for high-volume, cost-sensitive

5. Leaderboard Tracking

Track model performance over time to identify progress and regressions.

Version Tracking

Maintain a leaderboard across model versions:

class PerformanceLeaderboard:
    def __init__(self, benchmark_name):
        self.benchmark_name = benchmark_name
        self.entries = []
    
    def add_result(self, model_version, score, latency_ms, timestamp, commit_hash=''):
        entry = {
            'model': model_version,
            'score': score,
            'latency_ms': latency_ms,
            'timestamp': timestamp,
            'commit': commit_hash,
            'date': datetime.fromisoformat(timestamp).date()
        }
        self.entries.append(entry)
    
    def get_leaderboard(self, limit=20):
        """Return top performers"""
        sorted_entries = sorted(self.entries, key=lambda x: x['score'], reverse=True)
        return sorted_entries[:limit]
    
    def get_trend(self, model_version, days=30):
        """Get performance trend for a specific model"""
        cutoff = datetime.now() - timedelta(days=days)
        relevant = [
            e for e in self.entries
            if e['model'] == model_version and datetime.fromisoformat(e['timestamp']) > cutoff
        ]
        return sorted(relevant, key=lambda x: x['timestamp'])

Example leaderboard:

Rank  Model                      Score  Latency  Date         Commit
─────────────────────────────────────────────────────────────────
1.    claude-opus-20240229       92.5   1.2s     2026-04-18   a7f3e9
2.    claude-opus-20240115       92.1   1.3s     2026-04-17   b2d8c1
3.    claude-sonnet-20240229     85.3   0.8s     2026-04-18   a7f3e9
4.    gpt-4-20240229             84.8   2.1s     2026-04-17   (external)
5.    claude-sonnet-20240115     84.6   0.9s     2026-04-16   b2d8c1

Metric History

Store all metrics over time for trend analysis:

def store_metric_history(model, benchmark_name, metrics, timestamp):
    """Store metrics in time-series database"""
    db.save({
        'model': model,
        'benchmark': benchmark_name,
        'timestamp': timestamp,
        'metrics': {
            'score': metrics['avg_score'],
            'latency_p50': metrics['latency_p50'],
            'latency_p95': metrics['latency_p95'],
            'latency_p99': metrics['latency_p99'],
            'success_rate': metrics['success_rate'],
            'error_rate': metrics['error_rate'],
            'cost_per_success': metrics['cost_per_success']
        }
    })

def analyze_trend(model, benchmark_name, metric='score', days=90):
    """Analyze trend for a metric over time"""
    cutoff = datetime.now() - timedelta(days=days)
    history = db.query({
        'model': model,
        'benchmark': benchmark_name,
        'timestamp': {'$gte': cutoff}
    })
    
    timestamps = [h['timestamp'] for h in history]
    values = [h['metrics'][metric] for h in history]
    
    # Linear regression to find trend
    x = np.arange(len(values))
    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, values)
    
    return {
        'metric': metric,
        'current_value': values[-1],
        'start_value': values[0],
        'delta': values[-1] - values[0],
        'trend': 'improving' if slope > 0 else 'declining',
        'slope': slope,
        'r_squared': r_value ** 2,
        'statistically_significant': p_value < 0.05
    }

Correlating Improvements with Changes

Link performance jumps to specific changes:

def correlate_improvements(benchmark_name, git_log, metrics_history):
    """Find which commits caused quality improvements"""
    improvements = []
    
    for i in range(1, len(metrics_history)):
        prev = metrics_history[i-1]
        current = metrics_history[i]
        
        score_delta = current['metrics']['score'] - prev['metrics']['score']
        
        if score_delta > 2:  # Threshold for "improvement"
            # Find commit between these measurements
            commit = find_commit_between(
                prev['timestamp'],
                current['timestamp'],
                git_log
            )
            
            improvements.append({
                'timestamp': current['timestamp'],
                'improvement': score_delta,
                'commit': commit,
                'message': commit.message if commit else None,
                'author': commit.author if commit else None
            })
    
    return improvements

# Usage
improvements = correlate_improvements('python_basics', git_log, metrics_history)
for imp in improvements:
    print(f"Score +{imp['improvement']:.1f} - {imp['commit'].message}")

Identifying Best Configurations

When you have many variables (model, temperature, max_tokens, prompt variation), find the best:

def find_best_configuration(benchmark_suite, config_grid):
    """Grid search across configurations"""
    results = []
    
    for config in config_grid:
        harness = initialize_harness(**config)
        
        scores = []
        for case in benchmark_suite.cases:
            output = harness.generate(case['input'])
            score = evaluate_output(output, case['expected'])
            scores.append(score)
        
        results.append({
            'config': config,
            'mean_score': np.mean(scores),
            'std_dev': np.std(scores),
            'min_score': min(scores),
            'max_score': max(scores)
        })
    
    # Rank by score, then by variance (prefer stable configs)
    ranked = sorted(
        results,
        key=lambda x: (x['mean_score'], -x['std_dev']),
        reverse=True
    )
    
    return ranked

# Example config grid
config_grid = [
    {'model': 'opus', 'temperature': 0.7, 'max_tokens': 1000},
    {'model': 'opus', 'temperature': 0.5, 'max_tokens': 1000},
    {'model': 'opus', 'temperature': 0.7, 'max_tokens': 2000},
    {'model': 'sonnet', 'temperature': 0.7, 'max_tokens': 1000},
]

best = find_best_configuration(suite, config_grid)
print(f"Best config: {best[0]['config']}")
print(f"Score: {best[0]['mean_score']:.1f} ± {best[0]['std_dev']:.1f}")

6. Prompt Optimization Through Benchmarking

Use your benchmark suite to systematically improve prompts.

Establish Baseline with Current Prompt

def baseline_prompt_optimization(benchmark_suite, current_prompt):
    harness = initialize_harness(system_prompt=current_prompt)
    results = []
    
    for case in benchmark_suite.cases:
        output = harness.generate(case['input'])
        score = evaluate_output(output, case['expected'])
        results.append({
            'case_id': case['id'],
            'score': score,
            'output': output
        })
    
    baseline = {
        'prompt': current_prompt,
        'avg_score': np.mean([r['score'] for r in results]),
        'results': results,
        'timestamp': datetime.now().isoformat()
    }
    
    return baseline

Try Variations

Create prompt variants and test them:

def generate_prompt_variants(base_prompt, variations_config):
    """Generate prompt variations"""
    variants = []
    
    # Variation 1: Add explicit instructions
    variants.append({
        'name': 'explicit_instructions',
        'prompt': base_prompt + "\n\nBe explicit about your reasoning steps."
    })
    
    # Variation 2: Few-shot examples
    variants.append({
        'name': 'few_shot',
        'prompt': base_prompt + generate_few_shot_examples()
    })
    
    # Variation 3: Chain-of-thought
    variants.append({
        'name': 'chain_of_thought',
        'prompt': base_prompt + "\n\nThink step by step."
    })
    
    # Variation 4: Structured output
    variants.append({
        'name': 'structured_output',
        'prompt': base_prompt + "\n\nReturn results as JSON with keys: [...]"
    })
    
    # Variation 5: Role-based
    variants.append({
        'name': 'expert_role',
        'prompt': "You are an expert " + base_prompt
    })
    
    return variants

def test_prompt_variants(benchmark_suite, variants):
    """Test all variants"""
    results = {}
    
    for variant in variants:
        harness = initialize_harness(system_prompt=variant['prompt'])
        variant_results = []
        
        for case in benchmark_suite.cases:
            output = harness.generate(case['input'])
            score = evaluate_output(output, case['expected'])
            variant_results.append(score)
        
        results[variant['name']] = {
            'avg_score': np.mean(variant_results),
            'std_dev': np.std(variant_results),
            'improvement': np.mean(variant_results) - baseline['avg_score'],
            'scores': variant_results
        }
    
    return results

Measure Impact

def analyze_prompt_impact(baseline_score, variant_results):
    """Determine which variants improved performance"""
    analysis = []
    
    for name, metrics in variant_results.items():
        analysis.append({
            'variant': name,
            'score': metrics['avg_score'],
            'improvement': metrics['improvement'],
            'improvement_pct': (metrics['improvement'] / baseline_score) * 100,
            'stability': 1 - (metrics['std_dev'] / metrics['avg_score']),  # Higher = more stable
            'recommendation': 'adopt' if metrics['improvement'] > 1 else 'reject'
        })
    
    return sorted(analysis, key=lambda x: x['improvement'], reverse=True)

# Example output
# variant           score  improvement  improvement%  stability  recommendation
# few_shot          84.3   +3.2         +3.9%         0.87       adopt
# chain_of_thought  83.1   +2.0         +2.4%         0.85       adopt
# explicit_inst     81.5   +0.2         +0.3%         0.86       reject
# structured_output 80.2   -1.0         -1.2%         0.82       reject
# expert_role       81.8   +0.5         +0.6%         0.84       reject

Iterate Toward Improvement

Combine successful variants:

def iterative_prompt_improvement(benchmark_suite, baseline_prompt, iterations=3):
    current_best = baseline_prompt
    best_score = evaluate_prompt(benchmark_suite, current_best)
    
    for iteration in range(iterations):
        print(f"\n--- Iteration {iteration + 1} ---")
        print(f"Current best score: {best_score:.1f}")
        
        # Generate variants based on current best
        variants = generate_prompt_variants(current_best, {
            'add_examples': True,
            'add_structure': True,
            'add_reasoning': True
        })
        
        results = test_prompt_variants(benchmark_suite, variants)
        
        # Find best variant
        best_variant = max(
            results.items(),
            key=lambda x: x[1]['improvement']
        )
        
        if best_variant[1]['improvement'] > 0.5:
            current_best = get_variant_prompt(best_variant[0])
            best_score = best_variant[1]['avg_score']
            print(f"Improvement found: {best_variant[0]} (+{best_variant[1]['improvement']:.1f})")
        else:
            print("No improvement found. Stopping.")
            break
    
    return current_best, best_score

7. Tool Evaluation

If your harness uses multiple tools, measure their utility and reliability.

Tool Usage Metrics

def analyze_tool_usage(execution_logs, harness_name):
    """Analyze which tools are used and how often"""
    tool_calls = {}
    tool_errors = {}
    tool_latencies = {}
    
    for log in execution_logs:
        for call in log.get('tool_calls', []):
            tool_name = call['tool']
            
            # Count usage
            tool_calls[tool_name] = tool_calls.get(tool_name, 0) + 1
            
            # Track errors
            if call.get('error'):
                tool_errors[tool_name] = tool_errors.get(tool_name, 0) + 1
            
            # Track latency
            latency = call.get('latency_ms', 0)
            if tool_name not in tool_latencies:
                tool_latencies[tool_name] = []
            tool_latencies[tool_name].append(latency)
    
    # Compute summary stats
    summary = {}
    for tool_name in tool_calls.keys():
        total = tool_calls[tool_name]
        errors = tool_errors.get(tool_name, 0)
        latencies = tool_latencies.get(tool_name, [])
        
        summary[tool_name] = {
            'calls': total,
            'error_rate': (errors / total) * 100,
            'avg_latency_ms': np.mean(latencies),
            'p95_latency_ms': np.percentile(latencies, 95),
            'usage_pct': (total / sum(tool_calls.values())) * 100
        }
    
    return summary

# Example output
# Tool         Calls  Error%  AvgLatency  P95Latency  Usage%
# search       450    2.1%    450ms       890ms       45%
# wikipedia    320    1.2%    380ms       750ms       32%
# calculator   150    0.0%    120ms       200ms       15%
# translate     80    5.0%    520ms       1200ms      8%

Tool Error Rates

def categorize_tool_errors(execution_logs):
    """Break down tool errors by category"""
    error_categories = {}
    
    for log in execution_logs:
        for call in log.get('tool_calls', []):
            if call.get('error'):
                tool = call['tool']
                error = call['error']
                
                # Categorize
                if 'timeout' in error.lower():
                    category = 'timeout'
                elif 'rate limit' in error.lower():
                    category = 'rate_limit'
                elif 'not found' in error.lower():
                    category = 'not_found'
                elif 'invalid' in error.lower():
                    category = 'invalid_input'
                else:
                    category = 'other'
                
                key = f"{tool}_{category}"
                error_categories[key] = error_categories.get(key, 0) + 1
    
    return error_categories

Tool Latency Impact

def tool_latency_impact_analysis(execution_logs):
    """Measure how tool latency affects overall response time"""
    impacts = {
        'zero_tools': [],
        'one_tool': [],
        'two_tools': [],
        'three_plus_tools': []
    }
    
    for log in execution_logs:
        num_tools = len(log.get('tool_calls', []))
        total_latency = log.get('latency_ms', 0)
        
        if num_tools == 0:
            impacts['zero_tools'].append(total_latency)
        elif num_tools == 1:
            impacts['one_tool'].append(total_latency)
        elif num_tools == 2:
            impacts['two_tools'].append(total_latency)
        else:
            impacts['three_plus_tools'].append(total_latency)
    
    return {
        k: {'mean': np.mean(v), 'p95': np.percentile(v, 95)}
        for k, v in impacts.items() if v
    }

# Shows latency increases with tool count
# zero_tools:       mean=1200ms   p95=1800ms
# one_tool:         mean=2100ms   p95=3500ms
# two_tools:        mean=3200ms   p95=5100ms
# three_plus_tools: mean=4500ms   p95=7200ms

When to Deprecate/Replace Tools

Make decisions based on data:

def evaluate_tool_replacement(tool_name, current_metrics, replacement_metrics):
    """Decide whether to replace a tool"""
    decision = {
        'tool': tool_name,
        'current': current_metrics,
        'replacement': replacement_metrics,
        'factors': {}
    }
    
    # 1. Error rate
    if replacement_metrics['error_rate'] < current_metrics['error_rate'] * 0.8:
        decision['factors']['error_reduction'] = 'favorable'
    
    # 2. Latency
    if replacement_metrics['avg_latency'] < current_metrics['avg_latency'] * 0.8:
        decision['factors']['latency_improvement'] = 'favorable'
    
    # 3. Accuracy (if available)
    if replacement_metrics.get('accuracy', 100) > current_metrics.get('accuracy', 100):
        decision['factors']['accuracy_improvement'] = 'favorable'
    
    # 4. Cost
    if replacement_metrics.get('cost') and replacement_metrics['cost'] < current_metrics.get('cost', float('inf')):
        decision['factors']['cost_savings'] = 'favorable'
    
    favorable = sum(1 for v in decision['factors'].values() if v == 'favorable')
    
    if favorable >= 2:
        decision['recommendation'] = 'REPLACE'
    elif favorable == 1:
        decision['recommendation'] = 'CONSIDER'
    else:
        decision['recommendation'] = 'KEEP'
    
    return decision

8. Automated Quality Testing

Build tests that run on every commit to catch regressions immediately.

Regex/Schema Validation

def validate_schema(output, schema_definition):
    """Validate output matches expected schema"""
    try:
        data = json.loads(output)
        jsonschema.validate(instance=data, schema=schema_definition)
        return True, None
    except json.JSONDecodeError as e:
        return False, f"Invalid JSON: {e}"
    except jsonschema.ValidationError as e:
        return False, f"Schema mismatch: {e.message}"

# Example schema
CODE_GEN_SCHEMA = {
    "type": "object",
    "properties": {
        "code": {"type": "string"},
        "explanation": {"type": "string"},
        "test_cases": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["code", "explanation"],
    "additionalProperties": False
}

@pytest.mark.quality
def test_schema_compliance(harness, benchmark_case):
    output = harness.generate(benchmark_case['input'])
    is_valid, error = validate_schema(output, CODE_GEN_SCHEMA)
    assert is_valid, f"Schema validation failed: {error}"

Semantic Validation

def semantic_validation(output, context):
    """Check if output makes sense given context"""
    checks = []
    
    # 1. Non-empty output
    if not output or len(output.strip()) < 10:
        checks.append(('empty_output', False))
    else:
        checks.append(('empty_output', True))
    
    # 2. Output addresses input (simple word overlap)
    input_words = set(context['input'].lower().split())
    output_words = set(output.lower().split())
    overlap = len(input_words & output_words) / len(input_words)
    checks.append(('input_address', overlap > 0.2))
    
    # 3. No common error phrases
    error_phrases = ['error', 'failed', 'unable', 'cannot', 'invalid']
    has_errors = any(phrase in output.lower() for phrase in error_phrases)
    checks.append(('no_errors', not has_errors))
    
    # 4. Reasonable length (task-dependent)
    if context.get('expected_length') == 'short':
        length_ok = 50 < len(output) < 500
    elif context.get('expected_length') == 'long':
        length_ok = 500 < len(output) < 5000
    else:
        length_ok = 20 < len(output) < 10000
    checks.append(('reasonable_length', length_ok))
    
    all_pass = all(check[1] for check in checks)
    return all_pass, checks

@pytest.mark.quality
def test_semantic_validity(harness, benchmark_case):
    output = harness.generate(benchmark_case['input'])
    is_valid, checks = semantic_validation(output, benchmark_case)
    
    failed = [c[0] for c in checks if not c[1]]
    assert is_valid, f"Semantic validation failed: {failed}"

Factual Correctness

def factual_correctness_test(output, ground_truth, threshold=0.8):
    """Compare output against ground truth using multiple methods"""
    
    # Method 1: Exact match
    if output.strip() == ground_truth.strip():
        return True, 'exact_match', 1.0
    
    # Method 2: Semantic similarity (using embeddings)
    embedding_similarity = compute_embedding_similarity(output, ground_truth)
    if embedding_similarity > threshold:
        return True, 'semantic_similarity', embedding_similarity
    
    # Method 3: Key concepts overlap
    output_concepts = extract_key_concepts(output)
    truth_concepts = extract_key_concepts(ground_truth)
    concept_overlap = len(output_concepts & truth_concepts) / len(truth_concepts)
    if concept_overlap > threshold:
        return True, 'concept_overlap', concept_overlap
    
    return False, 'no_match', max(embedding_similarity, concept_overlap)

@pytest.mark.quality
@pytest.mark.parametrize("case", BENCHMARK_CASES)
def test_factual_correctness(harness, case):
    output = harness.generate(case['input'])
    is_correct, method, score = factual_correctness_test(
        output,
        case['expected_output'],
        threshold=0.75
    )
    assert is_correct, f"Factual test failed ({method}: {score:.2f})"

Style Compliance

def check_style_compliance(output, style_rules):
    """Verify output follows style guidelines"""
    violations = []
    
    # Rule 1: Tone check
    if style_rules.get('tone') == 'formal':
        informal_words = ['gonna', 'wanna', 'hey', 'cool', 'awesome']
        if any(word in output.lower() for word in informal_words):
            violations.append('informal_tone')
    
    # Rule 2: Length constraints
    if style_rules.get('max_length'):
        if len(output) > style_rules['max_length']:
            violations.append(f"exceeds_max_length ({len(output)} > {style_rules['max_length']})")
    
    # Rule 3: Format requirements
    if style_rules.get('format') == 'bullet_points':
        if not output.strip().startswith('-') and not output.strip().startswith('•'):
            violations.append('not_bullet_points')
    
    # Rule 4: Forbidden words/phrases
    if style_rules.get('forbidden_words'):
        for word in style_rules['forbidden_words']:
            if word.lower() in output.lower():
                violations.append(f"contains_forbidden_word: {word}")
    
    return len(violations) == 0, violations

@pytest.mark.quality
def test_style_compliance(harness, case):
    output = harness.generate(case['input'])
    is_compliant, violations = check_style_compliance(output, STYLE_RULES)
    assert is_compliant, f"Style violations: {violations}"

9. Example Benchmark Suites

1. Coding Tasks

{
  "name": "coding_fundamentals",
  "task_type": "code_generation",
  "language": "python",
  "count": 45,
  "difficulty_distribution": {
    "easy": 20,
    "medium": 15,
    "hard": 10
  },
  "categories": ["strings", "loops", "functions", "data_structures", "algorithms"],
  "cases": [
    {
      "id": "coding_001",
      "difficulty": "easy",
      "category": "strings",
      "input": "Write a function to reverse a string without using built-in reverse function",
      "expected": "def reverse_string(s):\n    return s[::-1]\n    # or: ''.join(reversed(s))",
      "evaluation": "syntax_valid + test_cases_pass",
      "test_cases": [
        {"input": "hello", "expected": "olleh"},
        {"input": "", "expected": ""},
        {"input": "a", "expected": "a"}
      ]
    },
    {
      "id": "coding_025",
      "difficulty": "hard",
      "category": "algorithms",
      "input": "Implement a function that finds the longest increasing subsequence in O(n log n) time",
      "expected": "def lis_length(nums):\n    if not nums: return 0\n    ...",
      "evaluation": "syntax_valid + test_cases_pass + efficient",
      "test_cases": [
        {"input": [3,10,2,1,20], "expected": 3},
        {"input": [3,3,3,3], "expected": 1},
        {"input": [], "expected": 0}
      ],
      "time_complexity": "O(n log n)",
      "space_complexity": "O(n)"
    }
  ]
}

2. Question Answering

{
  "name": "qa_knowledge",
  "task_type": "question_answering",
  "count": 40,
  "domains": ["science", "history", "geography", "literature"],
  "difficulty_distribution": {
    "easy": 15,
    "medium": 15,
    "hard": 10
  },
  "cases": [
    {
      "id": "qa_001",
      "difficulty": "easy",
      "domain": "science",
      "question": "What is the chemical formula for water?",
      "expected_answer": "H2O",
      "acceptable_variations": ["h2o", "H₂O", "dihydrogen monoxide"],
      "source": "general_knowledge",
      "evaluation": "factual_match"
    },
    {
      "id": "qa_025",
      "difficulty": "hard",
      "domain": "literature",
      "question": "In 'Moby Dick', what is the name of Captain Ahab's whaleboat?",
      "expected_answer": "the Pequod",
      "acceptable_variations": ["pequod", "the pequod"],
      "source": "moby_dick_chapter_32",
      "evaluation": "factual_match + synonym_tolerance"
    }
  ]
}

3. Summarization

{
  "name": "summarization_quality",
  "task_type": "summarization",
  "count": 30,
  "document_types": ["news", "research", "legal", "technical"],
  "target_length_words": ["50", "100", "200"],
  "cases": [
    {
      "id": "sum_001",
      "difficulty": "easy",
      "document_type": "news",
      "source": "reuters_article_20260415",
      "document": "Full article text here...",
      "target_length": 100,
      "reference_summary": "Company X announced quarterly earnings of $Y billion, up Z% year-over-year.",
      "evaluation": "rouge_l + manual_quality_score",
      "metrics": {
        "rouge_l_threshold": 0.4,
        "brevity_penalty": 0.9,
        "key_points_required": 2
      }
    }
  ]
}

4. Classification

{
  "name": "sentiment_classification",
  "task_type": "classification",
  "num_classes": 3,
  "classes": ["positive", "neutral", "negative"],
  "count": 50,
  "class_distribution": {
    "positive": 18,
    "neutral": 16,
    "negative": 16
  },
  "cases": [
    {
      "id": "sent_001",
      "difficulty": "easy",
      "text": "I love this product! It's amazing!",
      "expected_class": "positive",
      "confidence_threshold": 0.7
    },
    {
      "id": "sent_025",
      "difficulty": "hard",
      "text": "The product is okay, but it could be better in some ways.",
      "expected_class": "neutral",
      "confidence_threshold": 0.6,
      "notes": "Subjective - some might classify as mildly positive"
    }
  ]
}

10. Reporting & Dashboards

Quality Trends Over Time

def generate_trend_report(benchmark_name, days=90):
    """Generate a comprehensive trend report"""
    
    # Fetch metrics over time
    history = fetch_metric_history(benchmark_name, days=days)
    
    # Calculate trends
    report = {
        'benchmark': benchmark_name,
        'period_days': days,
        'measurement_count': len(history),
        'metrics': {}
    }
    
    # Metrics to track
    metric_keys = ['score', 'latency_p50', 'latency_p95', 'success_rate', 'cost_per_success']
    
    for metric in metric_keys:
        values = [h['metrics'][metric] for h in history]
        
        # Linear regression
        x = np.arange(len(values))
        slope, intercept, r_squared, p_value, _ = scipy.stats.linregress(x, values)
        
        report['metrics'][metric] = {
            'current': values[-1],
            'baseline': values[0],
            'change_pct': ((values[-1] - values[0]) / values[0]) * 100,
            'trend': 'improving' if slope > 0 and metric != 'latency_p50' else
                     'declining' if slope < 0 and metric != 'latency_p50' else
                     'degrading' if slope > 0 else 'improving',
            'trend_strength': abs(slope) / np.std(values) if np.std(values) > 0 else 0,
            'statistically_significant': p_value < 0.05
        }
    
    return report

# Example report:
# Benchmark: python_basics (90 days)
# 
# Metric              Current  Change    Trend         Significant
# ──────────────────────────────────────────────────────
# Score               87.5     +4.2%     improving ↗   Yes
# Latency (p50)       1240ms   +8.3%     degrading ↗   Yes
# Latency (p95)       2100ms   +2.1%     stable →      No
# Success Rate        96.5%    +1.2%     improving ↗   No
# Cost/Success        $0.015   -3.5%     improving ↘   Yes

Model Comparison Charts

def plot_model_comparison_matrix(models, benchmark_suite):
    """Create side-by-side comparison charts"""
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle(f'Model Comparison: {benchmark_suite.name}')
    
    # 1. Quality Score Distribution
    ax = axes[0, 0]
    for model in models:
        results = run_benchmark(model, benchmark_suite)
        scores = [r['score'] for r in results]
        ax.hist(scores, alpha=0.6, label=model.name, bins=10)
    ax.set_xlabel('Quality Score')
    ax.set_ylabel('Frequency')
    ax.set_title('Score Distribution')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 2. Latency Percentiles
    ax = axes[0, 1]
    model_names = []
    p50_latencies = []
    p95_latencies = []
    p99_latencies = []
    
    for model in models:
        results = run_benchmark(model, benchmark_suite)
        latencies = [r['latency_ms'] for r in results]
        model_names.append(model.name)
        p50_latencies.append(np.percentile(latencies, 50))
        p95_latencies.append(np.percentile(latencies, 95))
        p99_latencies.append(np.percentile(latencies, 99))
    
    x = np.arange(len(model_names))
    width = 0.25
    ax.bar(x - width, p50_latencies, width, label='p50')
    ax.bar(x, p95_latencies, width, label='p95')
    ax.bar(x + width, p99_latencies, width, label='p99')
    ax.set_xlabel('Model')
    ax.set_ylabel('Latency (ms)')
    ax.set_title('Latency Percentiles')
    ax.set_xticks(x)
    ax.set_xticklabels(model_names, rotation=45)
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    # 3. Cost vs Quality
    ax = axes[1, 0]
    for model in models:
        results = run_benchmark(model, benchmark_suite)
        avg_score = np.mean([r['score'] for r in results])
        cost = model.cost_usd
        ax.scatter(cost, avg_score, s=200, label=model.name)
    ax.set_xlabel('Cost per Call ($)')
    ax.set_ylabel('Quality Score')
    ax.set_title('Cost vs Quality Trade-off')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 4. Error Rates
    ax = axes[1, 1]
    error_rates = []
    for model in models:
        results = run_benchmark(model, benchmark_suite)
        error_rate = sum(1 for r in results if not r.get('success')) / len(results) * 100
        error_rates.append(error_rate)
    
    ax.barh(model_names, error_rates, color='coral')
    ax.set_xlabel('Error Rate (%)')
    ax.set_title('Error Rates')
    ax.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    return fig

Error Breakdown by Category

def generate_error_report(benchmark_results):
    """Detailed error analysis"""
    
    errors_by_category = {}
    errors_by_difficulty = {}
    
    for result in benchmark_results:
        if not result.get('success'):
            category = result.get('error_category', 'unknown')
            difficulty = result.get('difficulty', 'unknown')
            
            # By category
            errors_by_category[category] = errors_by_category.get(category, 0) + 1
            
            # By difficulty
            key = f"{difficulty}_{category}"
            errors_by_difficulty[key] = errors_by_difficulty.get(key, 0) + 1
    
    total_errors = sum(errors_by_category.values())
    
    report = {
        'total_errors': total_errors,
        'error_rate_pct': (total_errors / len(benchmark_results)) * 100,
        'by_category': {
            k: {'count': v, 'pct': (v / total_errors) * 100}
            for k, v in sorted(errors_by_category.items(), key=lambda x: x[1], reverse=True)
        },
        'by_difficulty': errors_by_difficulty
    }
    
    return report

# Example output:
# Total Errors: 42 (5.2%)
# 
# By Category:
#   Schema Error:        18 (42.9%)
#   Timeout:              8 (19.0%)
#   Hallucination:        7 (16.7%)
#   Logic Error:          5 (11.9%)
#   Incomplete Output:    4 (9.5%)
# 
# By Difficulty:
#   Easy errors:        10 (23.8%)
#   Medium errors:      15 (35.7%)
#   Hard errors:        17 (40.5%)

Cost vs Quality Scatter Plot

def plot_cost_quality_analysis(models_analysis):
    """Plot cost vs quality to show trade-offs"""
    
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Size = tokens per call, color = latency
    for analysis in models_analysis:
        ax.scatter(
            analysis['cost_per_call'],
            analysis['quality_score'],
            s=analysis['tokens_per_call'] / 5,  # Scale for visibility
            c=analysis['latency_ms'],
            cmap='viridis',
            alpha=0.7,
            edgecolors='black',
            linewidth=1.5,
            label=analysis['model']
        )
        
        # Annotate model name
        ax.annotate(
            analysis['model'],
            (analysis['cost_per_call'], analysis['quality_score']),
            xytext=(5, 5),
            textcoords='offset points',
            fontsize=9
        )
    
    ax.set_xlabel('Cost per Call ($)', fontsize=12)
    ax.set_ylabel('Quality Score (0-100)', fontsize=12)
    ax.set_title('Cost vs Quality Trade-off Analysis', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Add colorbar for latency
    cbar = plt.colorbar(ax.collections[0], ax=ax)
    cbar.set_label('Latency (ms)', fontsize=11)
    
    # Legend for bubble size
    sizes = [100, 500, 1000]
    for size in sizes:
        ax.scatter([], [], s=size/5, c='gray', alpha=0.6, edgecolors='black')
    ax.legend(scatterpoints=1, title='Tokens/call', loc='lower right')
    
    plt.tight_layout()
    return fig

Monthly/Quarterly Reports

def generate_quarterly_report(benchmark_suite, quarter):
    """Comprehensive quality report for business review"""
    
    # Fetch data for the quarter
    start_date = get_quarter_start(quarter)
    end_date = get_quarter_end(quarter)
    history = fetch_metric_history(benchmark_suite, start_date, end_date)
    
    report = {
        'title': f'{benchmark_suite} Quality Report - Q{quarter}',
        'period': f'{start_date.strftime("%Y-%m-%d")} to {end_date.strftime("%Y-%m-%d")}',
        'executive_summary': {},
        'metrics': {},
        'comparisons': {},
        'improvements': [],
        'regressions': [],
        'recommendations': []
    }
    
    # Executive summary
    first = history[0]
    last = history[-1]
    
    report['executive_summary'] = {
        'quality_change': f"{last['metrics']['score'] - first['metrics']['score']:+.1f} points",
        'quality_trend': 'improving' if last['metrics']['score'] > first['metrics']['score'] else 'declining',
        'efficiency_change': f"{last['metrics']['cost_per_success'] - first['metrics']['cost_per_success']:+.1f}%",
        'reliability': f"{last['metrics']['success_rate']:.1f}%",
        'key_achievements': identify_achievements(history),
        'priority_issues': identify_issues(history)
    }
    
    # Detailed metrics
    for metric in ['score', 'success_rate', 'cost_per_success', 'latency_p95']:
        values = [h['metrics'][metric] for h in history]
        report['metrics'][metric] = {
            'start': first['metrics'][metric],
            'end': last['metrics'][metric],
            'change_pct': ((last['metrics'][metric] - first['metrics'][metric]) / first['metrics'][metric]) * 100,
            'mean': np.mean(values),
            'std': np.std(values),
            'min': min(values),
            'max': max(values)
        }
    
    # Model comparisons (if testing multiple models)
    # ... (comparison metrics)
    
    # Improvements made
    report['improvements'] = analyze_improvements(history)
    
    # Regressions detected
    report['regressions'] = analyze_regressions(history)
    
    # Recommendations
    report['recommendations'] = [
        'Focus on latency optimization (p95 increased 15%)',
        'Review error categories - schema errors up from 2% to 4%',
        'Consider model upgrade - quality gains plateau',
        'Implement caching for high-latency operations'
    ]
    
    return report

Best Practices Summary

Define Clear Metrics: Start with task-specific success criteria. Not one-size-fits-all.
Automate Everything Possible: Regex validation, schema checks, syntax verification. Save human review for subjective calls.
Establish Baselines: You can’t measure improvement without knowing where you started.
Run Benchmarks Regularly: Daily or weekly, depending on release cadence. Catch regressions early.
Use Representative Test Cases: 40-50 well-chosen cases beat 500 random ones. Include edge cases.
Statistical Rigor: Not all improvements are real. Use significance testing before declaring victory.
Correlate Changes with Improvements: Link commits to performance jumps to identify what works.
Track Multiple Dimensions: Quality alone isn’t enough. Monitor cost, latency, and reliability too.
Create Clear Dashboards: Trends matter more than point values. Show before/after, not just latest.
Report to Stakeholders: Monthly summaries help product managers understand quality trajectory and ROI.

Implementation Checklist

Validation Checklist

How do you know you got this right?

Performance Checks

Benchmark suite runs in <5 minutes (40-50 cases, automated evaluation)
Metrics calculated in <30 seconds from benchmark data
Dashboard refreshes daily with <1 minute latency
Regression detection triggers within 5% deviation from baseline

Implementation Checks

Domain-specific success metrics defined (code: tests pass, content: relevance, etc)
Benchmark suite created: 40-50 representative cases including edge cases
Automated evaluation for 70%+ of metrics (schema, syntax, correctness)
Baseline established: before/after state clear for all metrics
Daily benchmark runs configured and executing reliably
A/B testing framework: can switch models and measure impact
Trend analysis working: can spot improvement/regression over time

Integration Checks

Benchmark connects to harness: can run against current version automatically
Quality metrics flow to dashboard: real-time monitoring visible
Regression alerts configured: team notified of >5% quality drop
Cost metrics tracked alongside quality: cost per quality point visible
Prompt evaluation: can measure impact of prompt changes (doc 15)

Common Failure Modes

Benchmark not representative: Test cases don’t match real usage distribution
Metrics not actionable: Dashboard shows numbers but unclear what to do
Regression detection too sensitive: False alarms for 1% variance
Manual evaluation bottleneck: Can’t scale to monitor all outputs
Baseline stale: Original baseline never updated, making progress invisible

Sign-Off Criteria

Benchmark suite running daily on CI/scheduler
Baseline metrics established and documented
Dashboard showing trends over 2+ weeks
A/B test completed: measured impact of one model or prompt change
First regression detected and handled (false alarm or real improvement)
Team understands metrics and dashboard interpretation