Skip to main content
Reference

Testing & Quality Assurance

Testing non-deterministic AI systems — success rate measurement, regression detection, acceptance criteria templates, and statistical significance.

Introduction: Why Testing Harnesses Is Different

Traditional software testing assumes determinism: run the same input, get the same output every time. AI harnesses violate this assumption.

The fundamental challenge: An LLM producing different outputs each run means your tests need to measure statistical success, not binary pass/fail.

Example:

  • Traditional test: “Given X, output is exactly Y” ✅ or ❌
  • Harness test: “Given X, output meets criteria C 94% of time” — and that 94% is the metric

This document covers testing strategies for non-deterministic systems, quality metrics that matter, regression detection patterns, and a complete pre-deployment checklist.


Part 1: Testing Non-Deterministic Systems

The Core Challenge: Stochasticity

LLMs use temperature/sampling to generate variety. This is desirable for creative tasks but complicates testing:

# Traditional software test (deterministic)
def test_add():
    assert add(2, 3) == 5  # Always true

# LLM test (non-deterministic) - WRONG approach
def test_agent_finds_bug():
    result = agent.run("Find the bug in this code")
    assert "bug found" in result  # Fails 10% of time (non-deterministic)

Solution: Collect statistics over multiple runs.

# Correct: Test the success rate, not individual runs
import statistics

def test_agent_success_rate():
    runs = 50
    successes = 0
    for _ in range(runs):
        result = agent.run("Find the bug in this code")
        if "bug found" in result:
            successes += 1
    
    success_rate = successes / runs
    assert success_rate >= 0.90, f"Expected ≥90%, got {success_rate:.1%}"
    # Result: Agent succeeds 48/50 = 96% ✅ (within budget)

Multiple-Run Testing Strategy

Instead of one pass/fail per test, run N times and measure:

MetricInterpretationThreshold
Success ratePercentage of runs that achieve goal≥90% for production
Mean latencyAverage time per run<10s for interactive agents
Latency p9595th percentile (worst-case performance)<30s for batch agents
Cost per success$ spent divided by successful runsBudget-dependent
Failure modesWhy did runs fail? (categories)<5% unexplained

Setting Up Multiple-Run Tests

import pytest
from dataclasses import dataclass
import time

@dataclass
class RunResult:
    success: bool
    latency_seconds: float
    cost: float
    error_message: Optional[str] = None

class AgentTestSuite:
    def __init__(self, agent, num_runs=50):
        self.agent = agent
        self.num_runs = num_runs
    
    def run_test(self, task_description, validation_fn):
        """
        Run task N times, validate each with validation_fn
        
        Args:
            task_description: Prompt to give agent
            validation_fn: Function that returns True if output is correct
        
        Returns:
            List of RunResult objects
        """
        results = []
        for i in range(self.num_runs):
            start = time.time()
            try:
                output = self.agent.run(task_description)
                success = validation_fn(output)
                latency = time.time() - start
                cost = self.agent.last_token_cost()
                
                results.append(RunResult(
                    success=success,
                    latency_seconds=latency,
                    cost=cost,
                    error_message=None if success else "Validation failed"
                ))
            except Exception as e:
                latency = time.time() - start
                results.append(RunResult(
                    success=False,
                    latency_seconds=latency,
                    cost=self.agent.last_token_cost(),
                    error_message=str(e)
                ))
        
        return results
    
    def analyze_results(self, results, test_name):
        """Print summary statistics"""
        successes = sum(1 for r in results if r.success)
        success_rate = successes / len(results)
        
        latencies = [r.latency_seconds for r in results]
        costs = [r.cost for r in results]
        
        print(f"\n=== {test_name} ===")
        print(f"Success rate: {successes}/{len(results)} = {success_rate:.1%}")
        print(f"Mean latency: {sum(latencies) / len(latencies):.2f}s")
        print(f"P95 latency: {sorted(latencies)[int(len(latencies) * 0.95)]:.2f}s")
        print(f"Total cost: ${sum(costs):.2f}")
        print(f"Cost per success: ${sum(costs) / successes:.4f}")
        
        # Failure analysis
        failures_by_reason = {}
        for r in results:
            if not r.success:
                reason = r.error_message or "Unknown"
                failures_by_reason[reason] = failures_by_reason.get(reason, 0) + 1
        
        if failures_by_reason:
            print(f"Failures by reason:")
            for reason, count in failures_by_reason.items():
                print(f"  {reason}: {count}")
        
        return {
            'success_rate': success_rate,
            'mean_latency': sum(latencies) / len(latencies),
            'p95_latency': sorted(latencies)[int(len(latencies) * 0.95)],
            'total_cost': sum(costs),
            'cost_per_success': sum(costs) / successes if successes > 0 else float('inf')
        }

# Usage example
def test_code_review_agent():
    suite = AgentTestSuite(agent=my_code_reviewer, num_runs=50)
    
    # Test: Agent finds bugs in intentionally broken code
    test_code = """
    def calculate_discount(price, discount):
        return price - (price * discount)  # Should be / not -
    """
    
    results = suite.run_test(
        task_description=f"Find bugs in this code:\n{test_code}",
        validation_fn=lambda output: "discount" in output.lower() and "wrong" in output.lower()
    )
    
    metrics = suite.analyze_results(results, "Code Review - Bug Detection")
    
    # Assert success rate meets threshold
    assert metrics['success_rate'] >= 0.85, f"Success rate {metrics['success_rate']:.1%} below 85% threshold"
    assert metrics['mean_latency'] < 10.0, f"Mean latency {metrics['mean_latency']:.1f}s exceeds 10s"

Seed Management: When to Fix Randomness

By default, LLM sampling is random. Sometimes you want reproducibility:

ScenarioTemperature SettingWhy
Production agent0.7 (default)Good balance: some variety, mostly coherent
Debugging failure case0.0 (deterministic)Reproduce exact failure reliably
Stress test0.3–0.5Controlled variety, less variance
Creative tasks0.9+Maximum diversity (brainstorm)
# Reproduce a specific failure
def test_agent_specific_failure():
    """When an agent fails, set temp=0.0 to reproduce deterministically"""
    agent.set_temperature(0.0)
    result = agent.run("Reproduce the failure case")
    agent.set_temperature(0.7)  # Reset to normal
    
    # Now inspect what went wrong
    assert validate(result), f"Failure reproduced: {result}"

# Stress test with controlled randomness
def test_agent_stability():
    """Test with reduced variance to detect instability"""
    agent.set_temperature(0.4)
    
    results = []
    for _ in range(20):
        output = agent.run("task")
        results.append(output)
    
    # Check: Are outputs mostly consistent?
    unique_outputs = len(set(results))
    agent.set_temperature(0.7)  # Reset
    
    print(f"Unique outputs from 20 runs: {unique_outputs}")

Part 2: Unit Testing Agents & Tools

Testing Individual Tools in Isolation

Tools are the most testable component. Unit test them independently of the agent:

import pytest
from unittest.mock import patch, MagicMock

class TestWebSearchTool:
    """Test web search tool without involving the agent"""
    
    def test_search_success(self):
        """Normal case: search returns results"""
        from harness.tools import web_search
        
        results = web_search("Python list comprehension")
        assert len(results) > 0
        assert all('title' in r and 'url' in r for r in results)
    
    def test_search_empty_results(self):
        """Edge case: unusual query returns few results"""
        from harness.tools import web_search
        
        results = web_search("xyzabc1234randomquery")
        assert isinstance(results, list)
        # Don't assert len > 0; empty results are valid
    
    def test_search_timeout(self):
        """Failure case: network timeout"""
        from harness.tools import web_search
        
        with patch('requests.get', side_effect=TimeoutError("Connection timeout")):
            with pytest.raises(Exception) as exc_info:
                web_search("query")
            assert "timeout" in str(exc_info.value).lower()
    
    def test_search_input_validation(self):
        """Security: reject dangerous inputs"""
        from harness.tools import web_search
        
        # Should sanitize/reject HTML injection attempts
        dangerous_query = "<script>alert('xss')</script>"
        result = web_search(dangerous_query)
        # Either raises or sanitizes
        assert isinstance(result, list)

class TestCodeExecutionTool:
    """Test code execution safely in sandbox"""
    
    def test_execute_valid_python(self):
        """Normal case: execute safe Python"""
        from harness.tools import execute_code
        
        output = execute_code("print(2 + 2)")
        assert "4" in output
    
    def test_execute_timeout(self):
        """Failure case: infinite loop (timeout)"""
        from harness.tools import execute_code
        
        with pytest.raises(TimeoutError):
            execute_code("while True: pass", timeout=2)
    
    def test_execute_blocked_operations(self):
        """Security: prevent file deletion"""
        from harness.tools import execute_code
        
        with pytest.raises(PermissionError):
            execute_code("import os; os.remove('/etc/passwd')")
    
    def test_execute_with_syntax_error(self):
        """Error handling: Python syntax error"""
        from harness.tools import execute_code
        
        with pytest.raises(SyntaxError):
            execute_code("print(this is broken")

# Run all tool tests
if __name__ == "__main__":
    pytest.main([__file__, "-v"])

Testing Tool Integration

Now test how tools work together:

class TestToolIntegration:
    """Test tools in combination without the full agent"""
    
    def test_search_then_parse(self):
        """Pipeline: search → parse results → validate"""
        from harness.tools import web_search, parse_html
        
        # Search for Python docs
        results = web_search("Python documentation")
        assert len(results) > 0
        
        # Parse first result
        first_url = results[0]['url']
        content = parse_html(first_url)
        assert len(content) > 0
        assert "python" in content.lower()
    
    def test_read_file_then_execute_code(self):
        """Pipeline: read Python file → execute it"""
        from harness.tools import read_file, execute_code
        import tempfile
        
        # Create temp Python file
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write("result = 'success'\nprint(result)")
            temp_path = f.name
        
        try:
            # Read and execute
            code = read_file(temp_path)
            output = execute_code(code)
            assert "success" in output
        finally:
            import os
            os.unlink(temp_path)

Mocking the LLM for Tool Testing

When you want to test tool handling without paying for API calls:

from unittest.mock import patch, MagicMock

class TestToolHandling:
    """Test how agent handles tool calls, without real LLM"""
    
    def test_agent_calls_search_when_asked(self):
        """Agent should call web search for information retrieval"""
        # Mock the LLM to always request web search
        mock_llm = MagicMock()
        mock_llm.generate.return_value = {
            'tool_calls': [
                {
                    'tool': 'web_search',
                    'args': {'query': 'Python async await'},
                    'id': 'call_1'
                }
            ]
        }
        
        agent = MyAgent(llm=mock_llm)
        result = agent.run("How does async/await work?")
        
        # Verify agent called the tool
        assert mock_llm.generate.called
        tool_call = mock_llm.generate.return_value['tool_calls'][0]
        assert tool_call['tool'] == 'web_search'
        assert 'async' in tool_call['args']['query'].lower()
    
    def test_agent_retries_failed_tool_call(self):
        """Agent should retry when tool fails"""
        mock_tool = MagicMock(side_effect=[
            Exception("Network error"),  # First call fails
            "Success"  # Second call succeeds
        ])
        
        agent = MyAgent(tools={'search': mock_tool})
        result = agent.run_with_retry("test", max_retries=2)
        
        assert mock_tool.call_count == 2
        assert "Success" in result

Part 3: Integration Testing

End-to-End Agent Testing

Test the complete agent workflow:

class TestAgentEndToEnd:
    """Complete agent workflow: setup → run → verify"""
    
    def test_agent_solves_coding_task(self):
        """E2E: Agent writes code to solve a specific problem"""
        agent = MyCodeWritingAgent()
        agent.setup()  # Initialize memory, tools, etc.
        
        # Give agent a task
        result = agent.run(
            "Write a Python function that finds all prime numbers up to 100"
        )
        
        # Verify result is valid Python
        assert "def " in result
        assert "prime" in result.lower()
        
        # Try executing the code (in sandbox)
        from harness.tools import execute_code
        output = execute_code(result + "\nprint(find_primes(20))")
        
        # Parse output and verify correctness
        primes = eval(output.strip())
        expected = [2, 3, 5, 7, 11, 13, 17, 19]
        assert primes == expected
    
    def test_agent_multi_tool_workflow(self):
        """E2E: Agent uses multiple tools in sequence"""
        agent = MyResearchAgent()
        
        # Task requiring search → parsing → code execution
        result = agent.run(
            "Research how to implement quicksort, write code, test it"
        )
        
        # Verify workflow:
        # 1. Agent searched (check memory/logs)
        # 2. Agent wrote code
        # 3. Agent ran tests
        
        assert len(agent.memory['searches']) > 0
        assert len(agent.memory['code_written']) > 0
        assert agent.memory['test_results']['success']

Memory Persistence Testing

Verify agent remembers across sessions:

class TestMemoryPersistence:
    """Test that agent state survives restart"""
    
    def test_session_continuity(self):
        """Agent remembers facts across sessions"""
        import tempfile
        import shutil
        
        temp_dir = tempfile.mkdtemp()
        try:
            # Session 1: Tell agent something
            agent1 = MyAgent(workspace=temp_dir)
            agent1.setup()
            agent1.run("Remember that my favorite color is blue")
            agent1.save()
            
            # Session 2: Agent should recall
            agent2 = MyAgent(workspace=temp_dir)
            agent2.load()
            response = agent2.run("What is my favorite color?")
            
            assert "blue" in response.lower()
        finally:
            shutil.rmtree(temp_dir)
    
    def test_memory_corruption_recovery(self):
        """Corrupted memory doesn't crash agent"""
        import tempfile
        import json
        
        temp_dir = tempfile.mkdtemp()
        memory_file = f"{temp_dir}/memory.json"
        
        # Write corrupt JSON
        with open(memory_file, 'w') as f:
            f.write("{ invalid json")
        
        agent = MyAgent(workspace=temp_dir)
        # Should handle gracefully
        agent.load()  # Should not crash
        agent.setup()  # Should initialize with defaults
        
        # Should still work
        result = agent.run("test")
        assert result is not None

Failure Scenario Testing

Test how agent handles problems:

class TestFailureScenarios:
    """Test agent resilience to failures"""
    
    def test_tool_timeout(self):
        """Tool times out; agent should handle gracefully"""
        from harness.tools import web_search
        
        with patch('requests.get', side_effect=TimeoutError("timeout")):
            agent = MyAgent()
            # Should handle timeout without crashing
            result = agent.run("Search for something")
            
            # Either falls back or retries
            assert result is not None or agent.error_log[-1]['type'] == 'timeout'
    
    def test_insufficient_context_window(self):
        """Context window filled; agent should summarize"""
        agent = MyAgent(context_limit=2000)  # Very small limit
        
        # Add lots of context
        large_text = "x" * 10000
        agent.add_to_context(large_text)
        
        # Agent should trim/summarize, not crash
        result = agent.run("What is the task?")
        assert result is not None
    
    def test_budget_exceeded(self):
        """Agent exceeds cost budget; should stop gracefully"""
        agent = MyAgent(token_budget=1000)
        
        # Mock expensive LLM calls
        expensive_task = "Solve this very complex problem " * 100
        
        with pytest.raises(BudgetExceededError):
            agent.run(expensive_task)
        
        # Agent should stop, not try to hide the error
        assert agent.total_cost > agent.token_budget
    
    def test_agent_stuck_in_loop(self):
        """Agent keeps doing same action; iteration limit kicks in"""
        # Create scenario where agent gets stuck
        mock_tool = MagicMock(return_value="unchanged")
        agent = MyAgent(tools={'test_tool': mock_tool}, max_iterations=5)
        
        result = agent.run("Do something", iteration_limit=5)
        
        # Should stop after 5 iterations even if not solved
        assert mock_tool.call_count <= 5
        assert "iteration limit" in agent.status.lower() or not result

Part 4: Regression Detection

Establishing Baselines

Before you can detect regressions, you need a baseline:

import json
from datetime import datetime

class BaselineManager:
    """Manage performance baselines for regression detection"""
    
    def __init__(self, baseline_file="baselines.json"):
        self.baseline_file = baseline_file
        self.baselines = self.load_baselines()
    
    def load_baselines(self):
        """Load existing baselines or create empty"""
        try:
            with open(self.baseline_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def save_baselines(self):
        """Persist baselines to disk"""
        with open(self.baseline_file, 'w') as f:
            json.dump(self.baselines, f, indent=2)
    
    def establish_baseline(self, test_name, results):
        """
        Establish a new baseline from multiple test runs.
        Call once when you set up a test.
        
        Args:
            test_name: Name of the test
            results: List of RunResult objects from run_test()
        """
        successes = sum(1 for r in results if r.success)
        latencies = [r.latency_seconds for r in results]
        
        baseline = {
            'test_name': test_name,
            'established': datetime.now().isoformat(),
            'num_runs': len(results),
            'success_rate': successes / len(results),
            'mean_latency': sum(latencies) / len(latencies),
            'p95_latency': sorted(latencies)[int(len(latencies) * 0.95)],
            'model_version': 'claude-3-sonnet',  # Or whatever model you use
            'notes': 'Initial baseline'
        }
        
        self.baselines[test_name] = baseline
        self.save_baselines()
        
        print(f"Baseline established for '{test_name}':")
        print(f"  Success rate: {baseline['success_rate']:.1%}")
        print(f"  Mean latency: {baseline['mean_latency']:.2f}s")

# Usage: Establish baselines when creating a new test
def setup_test_baselines():
    """Run once to establish baselines for all tests"""
    manager = BaselineManager()
    suite = AgentTestSuite(agent=my_agent, num_runs=100)
    
    # Test 1: Code review
    results = suite.run_test(
        "Review this code for bugs: ...",
        validation_fn=lambda out: "bug" in out.lower()
    )
    manager.establish_baseline("code_review", results)
    
    # Test 2: Document analysis
    results = suite.run_test(
        "Summarize this document: ...",
        validation_fn=lambda out: len(out) > 50
    )
    manager.establish_baseline("summarization", results)

Regression Detection: Continuous Testing

Run tests regularly and compare to baseline:

class RegressionDetector:
    """Detect quality degradation vs baseline"""
    
    def __init__(self, baseline_manager):
        self.baseline_manager = baseline_manager
        self.alert_threshold = 0.05  # Alert if 5% regression
    
    def test_for_regression(self, test_name, results):
        """
        Run test and compare to baseline.
        Alert if performance degrades.
        """
        baseline = self.baseline_manager.baselines.get(test_name)
        if not baseline:
            print(f"No baseline for {test_name}; skipping regression check")
            return
        
        # Calculate current metrics
        successes = sum(1 for r in results if r.success)
        current_success_rate = successes / len(results)
        
        latencies = [r.latency_seconds for r in results]
        current_mean_latency = sum(latencies) / len(latencies)
        current_p95 = sorted(latencies)[int(len(latencies) * 0.95)]
        
        # Compare to baseline
        baseline_success = baseline['success_rate']
        baseline_latency = baseline['mean_latency']
        
        success_delta = current_success_rate - baseline_success
        latency_delta = current_mean_latency - baseline_latency
        
        print(f"\n=== Regression Check: {test_name} ===")
        print(f"Success rate: {baseline_success:.1%}{current_success_rate:.1%}{success_delta:+.1%})")
        print(f"Mean latency: {baseline_latency:.2f}s → {current_mean_latency:.2f}s (Δ {latency_delta:+.2f}s)")
        
        # Alerts
        alerts = []
        
        if success_delta < -self.alert_threshold:
            alerts.append(f"ALERT: Success rate dropped {abs(success_delta):.1%}")
        
        if latency_delta > (baseline_latency * 0.2):  # 20% latency increase
            alerts.append(f"ALERT: Latency increased {abs(latency_delta):.2f}s")
        
        if alerts:
            for alert in alerts:
                print(alert)
            return False  # Regression detected
        
        print("✓ No regression detected")
        return True  # Passed

# Usage: Run regression tests after model update
def test_after_model_update():
    """
    After updating the model, verify quality didn't degrade.
    Run frequently (daily, per-commit, weekly).
    """
    detector = RegressionDetector(baseline_manager)
    suite = AgentTestSuite(agent=my_agent, num_runs=50)  # Fewer runs for frequent testing
    
    # Run each registered test
    for test_name in ['code_review', 'summarization', 'bug_finding']:
        results = suite.run_test(
            task_description=test_prompts[test_name],
            validation_fn=validators[test_name]
        )
        
        passed = detector.test_for_regression(test_name, results)
        
        if not passed:
            # Optional: Rollback, alert, etc.
            print(f"Regression in {test_name}; consider rolling back")

Tracking What Triggers Regression Tests

Document when regressions happen:

class RegressionLog:
    """Track regression incidents and causes"""
    
    def __init__(self, log_file="regression_log.json"):
        self.log_file = log_file
        self.log = self.load_log()
    
    def log_regression(self, test_name, cause, severity):
        """
        Record a regression incident.
        
        Args:
            test_name: Which test regressed
            cause: What caused it (e.g., "model_update", "config_change")
            severity: "minor" (5%), "moderate" (10%), "severe" (20%+)
        """
        entry = {
            'timestamp': datetime.now().isoformat(),
            'test_name': test_name,
            'cause': cause,
            'severity': severity,
            'git_commit': get_current_git_commit(),
            'model_version': get_model_version()
        }
        
        self.log.append(entry)
        self.save_log()
        print(f"Logged regression: {cause} in {test_name}")
    
    def analyze_causes(self):
        """What changes most often cause regressions?"""
        from collections import Counter
        
        causes = [entry['cause'] for entry in self.log]
        cause_counts = Counter(causes)
        
        print("\nMost common regression causes:")
        for cause, count in cause_counts.most_common():
            print(f"  {cause}: {count} incidents")
    
    def load_log(self):
        try:
            with open(self.log_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return []
    
    def save_log(self):
        with open(self.log_file, 'w') as f:
            json.dump(self.log, f, indent=2)

# Usage: Log regression when detected
if not regression_detector.test_for_regression(test_name, results):
    logger = RegressionLog()
    logger.log_regression(
        test_name=test_name,
        cause="model_update",  # or "config_change", "tool_modification"
        severity="moderate"
    )

Part 5: Performance Testing

Load Testing Agents

How many concurrent requests can your harness handle?

import concurrent.futures
import time
from statistics import mean, stdev

class LoadTestAgent:
    """Simulate multiple concurrent agent requests"""
    
    def __init__(self, agent, num_workers=5):
        self.agent = agent
        self.num_workers = num_workers
    
    def run_load_test(self, task, num_requests=100):
        """
        Simulate load: num_requests tasks, num_workers concurrent.
        Measure throughput and latency.
        """
        results = []
        errors = []
        
        def run_task():
            start = time.time()
            try:
                output = self.agent.run(task)
                latency = time.time() - start
                cost = self.agent.last_token_cost()
                return {
                    'success': True,
                    'latency': latency,
                    'cost': cost,
                    'output_length': len(output)
                }
            except Exception as e:
                latency = time.time() - start
                return {
                    'success': False,
                    'latency': latency,
                    'error': str(e)
                }
        
        # Run requests concurrently
        with concurrent.futures.ThreadPoolExecutor(max_workers=self.num_workers) as executor:
            futures = [executor.submit(run_task) for _ in range(num_requests)]
            results = [f.result() for f in concurrent.futures.as_completed(futures)]
        
        # Analyze
        successes = sum(1 for r in results if r['success'])
        latencies = [r['latency'] for r in results if r['success']]
        
        print(f"\n=== Load Test Results ===")
        print(f"Requests: {num_requests}, Workers: {self.num_workers}")
        print(f"Success rate: {successes}/{num_requests} = {successes/len(results):.1%}")
        print(f"Throughput: {successes / sum(latencies):.1f} req/sec")
        print(f"Mean latency: {mean(latencies):.2f}s")
        print(f"Latency stdev: {stdev(latencies):.2f}s" if len(latencies) > 1 else "")
        print(f"P95 latency: {sorted(latencies)[int(len(latencies) * 0.95)]:.2f}s")
        print(f"P99 latency: {sorted(latencies)[int(len(latencies) * 0.99)]:.2f}s")
        
        return {
            'success_rate': successes / len(results),
            'throughput': successes / sum(latencies),
            'mean_latency': mean(latencies),
            'p95_latency': sorted(latencies)[int(len(latencies) * 0.95)]
        }

# Usage
def test_harness_capacity():
    """Can harness handle realistic load?"""
    load_tester = LoadTestAgent(my_agent, num_workers=5)
    
    results = load_tester.run_load_test(
        task="Summarize this article",
        num_requests=50
    )
    
    # Assertions
    assert results['success_rate'] >= 0.95, "Sub-95% success under load"
    assert results['p95_latency'] < 15.0, "P95 latency exceeds 15s"

Latency Benchmarking

Measure how fast your agent responds:

class LatencyBenchmark:
    """Detailed latency analysis"""
    
    def __init__(self, agent):
        self.agent = agent
    
    def benchmark_by_complexity(self, test_cases):
        """
        Test latency across different task complexities.
        
        test_cases: List of (complexity_name, task_prompt)
        """
        results = {}
        
        for complexity_name, task_prompt in test_cases:
            latencies = []
            for _ in range(20):
                start = time.time()
                self.agent.run(task_prompt)
                latencies.append(time.time() - start)
            
            results[complexity_name] = {
                'mean': mean(latencies),
                'min': min(latencies),
                'max': max(latencies),
                'stdev': stdev(latencies) if len(latencies) > 1 else 0
            }
        
        print("\n=== Latency by Task Complexity ===")
        for complexity, stats in results.items():
            print(f"\n{complexity}:")
            print(f"  Mean: {stats['mean']:.2f}s")
            print(f"  Range: {stats['min']:.2f}s – {stats['max']:.2f}s")
            print(f"  Stdev: {stats['stdev']:.2f}s")
        
        return results

# Usage
benchmark = LatencyBenchmark(my_agent)
results = benchmark.benchmark_by_complexity([
    ("simple", "Is the sky blue?"),
    ("moderate", "Explain the theory of relativity"),
    ("complex", "Design a distributed database system")
])

Cost Per Task Tracking

Monitor spending per inference:

class CostAnalyzer:
    """Track and analyze cost per task"""
    
    def __init__(self):
        self.cost_history = []
    
    def analyze_task_cost(self, task_description, results):
        """
        Analyze costs for a completed task.
        
        Args:
            task_description: What the task was
            results: List of RunResult objects
        """
        costs = [r.cost for r in results if r.success]
        
        if not costs:
            print("No successful runs; can't analyze cost")
            return
        
        total_cost = sum(costs)
        mean_cost = total_cost / len(costs)
        min_cost = min(costs)
        max_cost = max(costs)
        
        entry = {
            'task': task_description,
            'num_runs': len(costs),
            'total_cost': total_cost,
            'mean_cost': mean_cost,
            'range': (min_cost, max_cost),
            'timestamp': datetime.now().isoformat()
        }
        
        self.cost_history.append(entry)
        
        print(f"\n=== Cost Analysis ===")
        print(f"Task: {task_description}")
        print(f"Successful runs: {len(costs)}")
        print(f"Mean cost per run: ${mean_cost:.4f}")
        print(f"Range: ${min_cost:.4f} – ${max_cost:.4f}")
        print(f"Total cost: ${total_cost:.2f}")
        
        return entry
    
    def monthly_cost_forecast(self, requests_per_day=100):
        """Project monthly cost based on recent history"""
        if not self.cost_history:
            print("No cost history; can't forecast")
            return
        
        recent_costs = [e['mean_cost'] for e in self.cost_history[-10:]]
        avg_cost = mean(recent_costs)
        
        monthly_cost = avg_cost * requests_per_day * 30
        
        print(f"\n=== Monthly Cost Forecast ===")
        print(f"Based on recent average: ${avg_cost:.4f} per request")
        print(f"At {requests_per_day} requests/day:")
        print(f"  Daily: ${avg_cost * requests_per_day:.2f}")
        print(f"  Monthly: ${monthly_cost:.2f}")

Part 6: Quality Metrics Framework

Core Metrics

Every harness should track these:

MetricDefinitionTargetTool
Task completion rate% of requests that achieve goal≥90%Multiple-run tests
Mean latency (p50)Median response time<10s for interactiveBenchmarks
Latency p9595th percentile (worst case)<30sBenchmarks
Cost per success$ spent / successesBudget-dependentCost analyzer
Hallucination rate% of outputs containing false claims<2%Manual review + automated checks
Error rate% of requests that error out<5%Error tracking
Tool success rate% of tool calls that succeed≥95%Tool logs

Sample Metrics Dashboard

class MetricsDashboard:
    """Collect and display all key metrics"""
    
    def __init__(self):
        self.metrics = {
            'task_completion': [],
            'latencies': [],
            'costs': [],
            'errors': [],
            'hallucinations': []
        }
    
    def record_task(self, success, latency, cost, error=None, hallucination=False):
        """Record one task execution"""
        self.metrics['task_completion'].append(success)
        if latency:
            self.metrics['latencies'].append(latency)
        if cost:
            self.metrics['costs'].append(cost)
        if error:
            self.metrics['errors'].append(error)
        if hallucination:
            self.metrics['hallucinations'].append(1)
    
    def display_dashboard(self):
        """Print current metrics"""
        completions = self.metrics['task_completion']
        latencies = self.metrics['latencies']
        costs = self.metrics['costs']
        
        print(f"\n{'='*50}")
        print(f"{'HARNESS METRICS DASHBOARD':^50}")
        print(f"{'='*50}")
        
        if completions:
            comp_rate = sum(completions) / len(completions)
            print(f"\nTask Completion: {comp_rate:.1%} ({sum(completions)}/{len(completions)})")
        
        if latencies:
            print(f"\nLatency:")
            print(f"  Mean (p50): {sorted(latencies)[len(latencies)//2]:.2f}s")
            print(f"  P95: {sorted(latencies)[int(len(latencies)*0.95)]:.2f}s")
            print(f"  P99: {sorted(latencies)[int(len(latencies)*0.99)]:.2f}s")
        
        if costs:
            total = sum(costs)
            print(f"\nCosts:")
            print(f"  Total: ${total:.2f}")
            print(f"  Mean/task: ${total/len(costs):.4f}")
            print(f"  Monthly (100 req/day): ${(total/len(costs)) * 100 * 30:.2f}")
        
        if self.metrics['errors']:
            error_rate = len(self.metrics['errors']) / len(completions)
            print(f"\nErrors: {error_rate:.1%}")
        
        print(f"\n{'='*50}\n")

Part 7: Pre-Deployment Checklist

Before shipping an agent to production:

Smoke Tests (Basic Functionality)

class SmokeTests:
    """Quick validation that agent works at all"""
    
    def run_all(self, agent):
        """Run smoke tests"""
        checks = [
            ("Agent initializes", self.test_initialization),
            ("Tools available", self.test_tools_available),
            ("Can process input", self.test_basic_task),
            ("Can handle error", self.test_error_handling),
            ("Memory persists", self.test_memory),
            ("Cost tracking works", self.test_cost_tracking)
        ]
        
        print("\n=== SMOKE TESTS ===\n")
        
        passed = 0
        for name, test_fn in checks:
            try:
                test_fn(agent)
                print(f"✓ {name}")
                passed += 1
            except Exception as e:
                print(f"✗ {name}: {e}")
        
        print(f"\nResult: {passed}/{len(checks)} passed")
        return passed == len(checks)
    
    def test_initialization(self, agent):
        assert agent is not None
        assert agent.model is not None
    
    def test_tools_available(self, agent):
        tools = agent.available_tools()
        assert len(tools) > 0
    
    def test_basic_task(self, agent):
        result = agent.run("Say hello")
        assert result is not None
        assert len(result) > 0
    
    def test_error_handling(self, agent):
        # Intentional error; should handle gracefully
        try:
            agent.run("undefined_command()")
        except Exception as e:
            # Expected: Error should be handled
            assert "error" in str(e).lower() or agent.last_error is not None
    
    def test_memory(self, agent):
        agent.run("Remember: test_value = 123")
        result = agent.run("What did I tell you to remember?")
        assert "123" in result or "test_value" in result.lower()
    
    def test_cost_tracking(self, agent):
        agent.run("test")
        cost = agent.total_cost()
        assert isinstance(cost, (int, float))
        assert cost >= 0

Performance Baseline Validation

class PreDeploymentValidation:
    """Comprehensive pre-deployment checks"""
    
    def validate_all(self, agent, baseline_manager):
        """Run complete pre-deployment validation"""
        checks = []
        
        # Smoke tests
        print("\n1. Running smoke tests...")
        smoke = SmokeTests()
        if not smoke.run_all(agent):
            print("✗ Smoke tests failed; do not deploy")
            return False
        
        # Quality metrics
        print("\n2. Validating quality metrics...")
        suite = AgentTestSuite(agent, num_runs=50)
        results = suite.run_test(
            "Main task",
            validation_fn=lambda x: True  # Custom validation
        )
        metrics = suite.analyze_results(results, "Pre-deployment")
        
        if metrics['success_rate'] < 0.85:
            print(f"✗ Success rate {metrics['success_rate']:.1%} below 85% threshold")
            return False
        
        # Regression check
        print("\n3. Checking for regressions...")
        detector = RegressionDetector(baseline_manager)
        if not detector.test_for_regression("main_test", results):
            print("✗ Regression detected; investigate before deploying")
            return False
        
        # Cost validation
        print("\n4. Validating cost is within budget...")
        daily_budget = 100.0  # dollars
        estimated_daily_cost = metrics['cost_per_success'] * 1000  # Assume 1000 daily requests
        
        if estimated_daily_cost > daily_budget:
            print(f"✗ Estimated daily cost ${estimated_daily_cost:.2f} exceeds budget ${daily_budget}")
            return False
        
        print(f"✓ Estimated daily cost: ${estimated_daily_cost:.2f} (within budget)")
        
        # Security review
        print("\n5. Security review...")
        security_passed = self.validate_security(agent)
        if not security_passed:
            print("✗ Security issues found")
            return False
        
        # Logging configured
        print("\n6. Checking logging...")
        if not agent.logger:
            print("✗ Logging not configured")
            return False
        
        print("✓ All checks passed; ready to deploy")
        return True
    
    def validate_security(self, agent):
        """Check security constraints"""
        # These are examples; customize for your harness
        security_checks = [
            ("API keys not in code", lambda a: not self.has_hardcoded_secrets(a)),
            ("Input validation enabled", lambda a: a.validate_input),
            ("Output sanitization enabled", lambda a: a.sanitize_output),
            ("Rate limiting configured", lambda a: a.rate_limiter is not None)
        ]
        
        all_passed = True
        for check_name, check_fn in security_checks:
            try:
                if check_fn(agent):
                    print(f"  ✓ {check_name}")
                else:
                    print(f"  ✗ {check_name}")
                    all_passed = False
            except Exception as e:
                print(f"  ? {check_name}: {e}")
        
        return all_passed
    
    def has_hardcoded_secrets(self, agent):
        """Check for hardcoded API keys, tokens, etc."""
        import inspect
        source = inspect.getsource(agent)
        secret_patterns = [
            r'sk_live_',  # Stripe
            r'api[_-]?key\s*=\s*["\']',  # Generic API key
            r'password\s*=\s*["\']',  # Password
        ]
        
        import re
        for pattern in secret_patterns:
            if re.search(pattern, source):
                return True  # Found secret
        
        return False  # No secrets found

Part 8: Continuous Testing Strategy

Testing Frequency

Test TypeFrequencyPurpose
Unit tests (tools)Per commitCatch regressions early
Smoke testsDailyBasic functionality
Quality metricsWeeklyTrack trends
Regression testsAfter model update, per releaseDetect degradation
Load testsMonthlyCapacity planning
Full validationBefore deploymentGo/no-go decision
class ContinuousTestingSchedule:
    """Automated testing schedule"""
    
    def run_daily_tests(self):
        """Run every day at 8 AM"""
        print("=== Daily Tests ===")
        
        # Smoke tests
        smoke = SmokeTests()
        smoke.run_all(agent)
        
        # Quick quality check (smaller sample)
        suite = AgentTestSuite(agent, num_runs=20)
        results = suite.run_test("task", validation_fn)
        metrics = suite.analyze_results(results, "Daily quality check")
    
    def run_weekly_tests(self):
        """Run every Monday"""
        print("=== Weekly Tests ===")
        
        # Full quality metrics (larger sample)
        suite = AgentTestSuite(agent, num_runs=100)
        results = suite.run_test("task", validation_fn)
        metrics = suite.analyze_results(results, "Weekly quality")
        
        # Alert if degraded
        if metrics['success_rate'] < 0.85:
            send_alert(f"Quality degradation: {metrics['success_rate']:.1%}")
    
    def run_pre_release_tests(self):
        """Before deploying new version"""
        print("=== Pre-Release Validation ===")
        
        validator = PreDeploymentValidation()
        go_nogo = validator.validate_all(agent, baseline_manager)
        
        if go_nogo:
            print("✓ APPROVED FOR RELEASE")
            return True
        else:
            print("✗ BLOCKED: Fix issues before release")
            return False

Part 9: Failure Case Testing

Common Failure Modes

class FailureModeTests:
    """Test how agent handles predictable failures"""
    
    def test_tool_failure_recovery(self):
        """Tool fails; agent should retry or fallback"""
        from unittest.mock import patch
        
        with patch('harness.tools.web_search', side_effect=Exception("Network error")):
            agent = MyAgent()
            result = agent.run("Search for something")
            
            # Agent should:
            # 1. Catch the error
            # 2. Log it
            # 3. Retry or fallback
            assert agent.has_logged_error
            assert agent.fallback_used or agent.retry_count > 0
    
    def test_model_hallucination(self):
        """Detect when model makes up facts"""
        from harness.validation import check_hallucination
        
        # Run agent on task where ground truth is known
        ground_truth = {"capital_of_france": "Paris"}
        result = agent.run("What is the capital of France?")
        
        # Validate factual claims
        hallucinated = check_hallucination(result, ground_truth)
        
        if hallucinated:
            print(f"Agent hallucinated: {hallucinated}")
            # Log incident
            agent.log_hallucination(result, hallucinated)
    
    def test_context_overflow(self):
        """Context window full; agent should handle"""
        # Fill context with large text
        large_prompt = "X" * 50000
        
        agent = MyAgent(context_limit=4096)
        result = agent.run(large_prompt)
        
        # Should either:
        # 1. Summarize to fit context
        # 2. Reject gracefully
        # 3. Chunk and process in parts
        
        assert result is not None or agent.error_type == "context_overflow"
    
    def test_budget_overrun_protection(self):
        """Cost exceeds budget; should stop"""
        agent = MyAgent(budget=10.0)  # $10 budget
        
        expensive_task = "Solve a very complex problem " * 1000
        
        with pytest.raises(BudgetExceededError):
            agent.run(expensive_task)
        
        # Verify agent stopped cleanly
        assert agent.total_cost <= agent.budget * 1.05  # Allow 5% overage for last request
    
    def test_iteration_limit(self):
        """Agent stuck in loop; iteration limit stops it"""
        # Create scenario where agent repeats same action
        agent = MyAgent(max_iterations=5)
        
        result = agent.run("Impossible task")
        
        # Should stop after 5 iterations
        assert agent.iteration_count <= 5
        assert "iteration limit" in agent.status.lower() or result is None

Part 10: Test Fixtures & Sample Test Suite

Creating Representative Test Cases

class TestDatasets:
    """Curated test cases for validation"""
    
    @staticmethod
    def get_code_review_tests():
        """Test cases for code review agent"""
        return [
            {
                "name": "Obvious bug",
                "code": "def divide(a, b):\n    return a / b  # Bug: no zero check",
                "expected_findings": ["zero", "divide", "error"]
            },
            {
                "name": "Logic error",
                "code": "if x > 0:\n    print('negative')",
                "expected_findings": ["logic", "contradiction"]
            },
            {
                "name": "Memory leak",
                "code": "class Handler:\n    def __init__(self):\n        self.listeners = []  # Never cleared",
                "expected_findings": ["leak", "memory", "clear"]
            }
        ]
    
    @staticmethod
    def get_writing_tests():
        """Test cases for writing/summary agent"""
        return [
            {
                "name": "Technical summary",
                "content": "Quantum computing uses quantum bits...",
                "validation": lambda out: len(out) > 50 and "quantum" in out.lower()
            },
            {
                "name": "Simplified explanation",
                "content": "Einstein's theory of relativity states...",
                "validation": lambda out: len(out) > 50 and "simple" not in out.lower()  # Should be complex
            }
        ]

# Complete sample test suite
class SampleHarnessTestSuite:
    """Ready-to-use test suite for your harness"""
    
    def __init__(self, agent):
        self.agent = agent
        self.baseline_manager = BaselineManager()
    
    def run_all(self):
        """Run complete test suite"""
        print("\n" + "="*60)
        print("COMPREHENSIVE HARNESS TEST SUITE")
        print("="*60)
        
        # 1. Unit tests
        print("\n[1/5] Unit Tests (Tools)")
        self.run_unit_tests()
        
        # 2. Smoke tests
        print("\n[2/5] Smoke Tests")
        smoke = SmokeTests()
        smoke_passed = smoke.run_all(self.agent)
        
        if not smoke_passed:
            print("Smoke tests failed; aborting further tests")
            return False
        
        # 3. Quality metrics
        print("\n[3/5] Quality Metrics (50 runs)")
        suite = AgentTestSuite(self.agent, num_runs=50)
        results = suite.run_test(
            "Main task",
            validation_fn=self.main_validation
        )
        metrics = suite.analyze_results(results, "Main task")
        
        # 4. Regression detection
        print("\n[4/5] Regression Detection")
        detector = RegressionDetector(self.baseline_manager)
        regression_passed = detector.test_for_regression("main", results)
        
        # 5. Pre-deployment validation
        print("\n[5/5] Pre-Deployment Validation")
        validator = PreDeploymentValidation()
        deployment_ready = validator.validate_all(self.agent, self.baseline_manager)
        
        # Summary
        print("\n" + "="*60)
        print("TEST SUITE SUMMARY")
        print("="*60)
        print(f"Smoke tests: {'PASS' if smoke_passed else 'FAIL'}")
        print(f"Quality metrics: {metrics['success_rate']:.1%}")
        print(f"Regression check: {'PASS' if regression_passed else 'FAIL'}")
        print(f"Deployment ready: {'YES' if deployment_ready else 'NO'}")
        print("="*60 + "\n")
        
        return deployment_ready
    
    def run_unit_tests(self):
        """Run tool unit tests"""
        test_tools = TestWebSearchTool()
        test_tools.test_search_success()
        test_tools.test_search_input_validation()
        print("✓ Web search tool tests passed")
        
        test_code = TestCodeExecutionTool()
        test_code.test_execute_valid_python()
        test_code.test_execute_blocked_operations()
        print("✓ Code execution tool tests passed")
    
    def main_validation(self, output):
        """Customize this for your agent"""
        return output is not None and len(output) > 10

# Usage: Run complete test suite
suite = SampleHarnessTestSuite(my_agent)
ready_to_deploy = suite.run_all()

if ready_to_deploy:
    print("✓ Agent approved for production deployment")
else:
    print("✗ Agent requires fixes before deployment")

Key Takeaways

For Testing Non-Deterministic Systems

  • Don’t test individual runs: Collect statistics over N runs (typically 50–100)
  • Measure success rate, not pass/fail: “Agent succeeds 94% of time” is the metric
  • Track failure modes: Understand why runs fail, not just that they do
  • Use seeds strategically: Set temperature=0.0 only for debugging specific failures

For Regression Detection

  • Establish baselines first: Run test N times when creating it, save metrics
  • Test frequently: Daily smoke tests, weekly full tests, always before releases
  • Alert on degradation: Success rate drops 5%+, latency increases 20%+
  • Track causes: Log what changes triggered regressions (model update, config, tools)

For Pre-Deployment

  • Run smoke tests: Verify basic functionality
  • Check quality metrics: Success rate ≥90%, cost within budget
  • Run regression tests: Compare to baseline, ensure no degradation
  • Security review: No hardcoded secrets, input validation, rate limiting
  • Load test: Verify handles expected concurrency

Tools & Patterns

PurposePatternExample
Unit test toolsMock LLM, test tool in isolationTestWebSearchTool class
Integration testEnd-to-end workflow, verify output qualityTestAgentEndToEnd class
Load testingConcurrent.futures, measure throughputLoadTestAgent class
Regression detectionBaseline + statistical comparisonRegressionDetector class
Metrics dashboardCollect and display key statsMetricsDashboard class
Pre-deploymentComprehensive checklistPreDeploymentValidation class

  • 06_harness_architecture.md — Component failure modes, recovery strategies
  • 05_ai_agents.md — Iteration limits, handling stuck agents
  • 08_claw_code_python.md — Instrumentation patterns, logging
  • 04_memory_systems.md — Memory validation, persistence testing

References & Further Reading

  • Paper: “Is RLHF Required for Human Preference Alignment?” (Zyg et al., 2024) — On measuring alignment quality statistically
  • Tool: lm-evaluation-harness (EleutherAI) — Standard benchmarking framework
  • Tool: pytest — Python testing framework used in examples
  • Pattern: “Chaos engineering for LLMs” — Intentionally failing components to test resilience
  • Resource: GitHub’s “Software Performance Testing” guide — Load testing strategies

Document Metadata

Created: April 2026
Scope: Testing and QA strategies for AI harnesses
Audience: Engineers building and deploying harnesses
Key concepts: Non-deterministic testing, regression detection, quality metrics, pre-deployment validation
Code examples: 25+ runnable Python patterns
Estimated reading time: 2–3 hours
Estimated implementation time: 1–2 weeks (build test infrastructure)


Next step: Pick one testing pattern that matters most to your harness (unit tests, regression detection, or pre-deployment checklist) and implement it this week. The infrastructure you build compounds; every test reduces deployment risk.


Part 11: Pre-Production Acceptance Criteria Template

Go/No-Go Decision Template

Copy this checklist into your release process. Every item must be checked before deploying to production. Fill in the measured values and compare against thresholds.

# Pre-Production Acceptance Criteria
# Date: ____________________
# Version: _________________
# Reviewer: ________________

## Quality Gates

| Gate                        | Threshold       | Measured   | Pass? |
|-----------------------------|-----------------|------------|-------|
| Task success rate           | >= 90%          | ____%      | [ ]   |
| Hallucination rate          | <= 2%           | ____%      | [ ]   |
| Tool call success rate      | >= 95%          | ____%      | [ ]   |
| Error rate (unhandled)      | <= 5%           | ____%      | [ ]   |
| Regression vs baseline      | No degradation  | +/- ____%  | [ ]   |

## Latency Thresholds

| Metric                      | Threshold       | Measured   | Pass? |
|-----------------------------|-----------------|------------|-------|
| Step latency p50            | < 5s            | ____s      | [ ]   |
| Step latency p95            | < 15s           | ____s      | [ ]   |
| End-to-end session p95      | < 60s           | ____s      | [ ]   |
| Tool call latency p95       | < 10s           | ____s      | [ ]   |

## Cost Ceilings

| Metric                      | Threshold       | Measured   | Pass? |
|-----------------------------|-----------------|------------|-------|
| Cost per successful task    | < $0.50         | $____      | [ ]   |
| Projected daily cost        | < $____budget   | $____      | [ ]   |
| Projected monthly cost      | < $____budget   | $____      | [ ]   |

## Security & Operations

| Check                                         | Pass? |
|-----------------------------------------------|-------|
| No hardcoded secrets in codebase              | [ ]   |
| Input validation enabled                      | [ ]   |
| Rate limiting configured                      | [ ]   |
| Structured logging active                     | [ ]   |
| Budget enforcement (hard limit) active        | [ ]   |
| Health check endpoints responding             | [ ]   |
| Alerting configured (see Doc 09)              | [ ]   |

## Decision

[ ] GO — All gates passed, approved for production
[ ] NO-GO — Failures listed below must be resolved

Blocking issues:
1. _______________________________________________
2. _______________________________________________

How Test Results Feed Into Monitoring (Doc 09)

The quality gates above establish baseline thresholds that production monitoring should enforce continuously:

# Map test acceptance criteria to production alerts
ACCEPTANCE_TO_ALERTS = {
    # Test gate             -> Production alert rule
    "success_rate >= 90%":    "alert if harness_success_rate < 0.90 for 5m",
    "p95_latency < 15s":     "alert if harness_latency_step_p95_ms > 15000 for 5m",
    "error_rate <= 5%":      "alert if harness_errors_rate > 0.05 for 5m",
    "cost_per_task < $0.50": "alert if harness_cost_per_session_usd > 0.50",
}

# After running acceptance tests, generate a Prometheus alerting config
def generate_alert_rules(acceptance_results: dict) -> str:
    """Convert acceptance criteria into Prometheus alert rules."""
    rules = []
    rules.append("groups:")
    rules.append("- name: harness_acceptance")
    rules.append("  rules:")

    if acceptance_results["success_rate"] >= 0.90:
        rules.append("  - alert: SuccessRateBelowBaseline")
        rules.append(f"    expr: harness_success_rate < {acceptance_results['success_rate'] - 0.05}")
        rules.append("    for: 5m")
        rules.append("    labels: { severity: critical }")

    if acceptance_results["p95_latency"]:
        threshold_ms = int(acceptance_results["p95_latency"] * 1000 * 1.2)  # 20% headroom
        rules.append("  - alert: LatencyAboveBaseline")
        rules.append(f"    expr: harness_latency_step_p95_ms > {threshold_ms}")
        rules.append("    for: 5m")
        rules.append("    labels: { severity: warning }")

    return "\n".join(rules)

How to Calculate Statistical Significance of Test Results

With non-deterministic systems, you need to know whether a measured success rate of 88% is genuinely below your 90% threshold or just noise. Use a binomial proportion confidence interval:

import math

def is_statistically_significant(
    successes: int,
    total_runs: int,
    threshold: float,
    confidence: float = 0.95
) -> dict:
    """
    Determine if measured success rate is significantly below threshold.

    Args:
        successes: Number of successful runs
        total_runs: Total number of runs
        threshold: Required success rate (e.g., 0.90)
        confidence: Confidence level (default 95%)

    Returns:
        Dict with measured rate, confidence interval, and verdict
    """
    p_hat = successes / total_runs

    # Z-score for confidence level (1.96 for 95%, 2.58 for 99%)
    z = {0.90: 1.645, 0.95: 1.96, 0.99: 2.58}.get(confidence, 1.96)

    # Wilson score interval (better than normal approximation for small N)
    denominator = 1 + z**2 / total_runs
    centre = (p_hat + z**2 / (2 * total_runs)) / denominator
    margin = z * math.sqrt((p_hat * (1 - p_hat) + z**2 / (4 * total_runs)) / total_runs) / denominator

    lower_bound = centre - margin
    upper_bound = centre + margin

    # Verdict: fail only if upper bound of confidence interval is below threshold
    significant_failure = upper_bound < threshold

    return {
        "measured_rate": round(p_hat, 4),
        "confidence_interval": (round(lower_bound, 4), round(upper_bound, 4)),
        "threshold": threshold,
        "significant_failure": significant_failure,
        "verdict": "FAIL (significant)" if significant_failure else "PASS (within noise)",
        "recommendation": f"Run {max(100, total_runs * 2)} tests for tighter bounds" if not significant_failure and p_hat < threshold else None
    }

# Usage examples
print(is_statistically_significant(successes=44, total_runs=50, threshold=0.90))
# measured_rate=0.88, CI=(0.7639, 0.9434), verdict="PASS (within noise)"
# 88% is below 90%, but the CI includes 90% — could be noise with only 50 runs

print(is_statistically_significant(successes=82, total_runs=100, threshold=0.90))
# measured_rate=0.82, CI=(0.7342, 0.8838), verdict="FAIL (significant)"
# 82% with 100 runs — CI upper bound 88.4% is below 90%, genuine regression

Rule of thumb for sample sizes:

  • 50 runs: Can detect 10%+ regressions with 95% confidence
  • 100 runs: Can detect 5%+ regressions with 95% confidence
  • 200 runs: Can detect 3%+ regressions with 95% confidence

If you need to distinguish 89% from 91%, you need at least 200 runs. For most go/no-go decisions, 50-100 runs is sufficient.


See Also

  • Doc 09 (Operations & Observability) — Monitoring is continuous testing; quality metrics in production feed back into test baselines
  • Doc 06 (Harness Architecture) — Understand the components you’re testing; testing strategy differs per component
  • Doc 05 (AI Agents) — Understand agentic behavior for non-deterministic testing frameworks
  • Doc 16 (Evaluation & Benchmarking) — Formal evaluation methods for complex reasoning; extends testing to production metrics