Skip to main content
Reference

Building a Python Agent Harness

Building an AI agent harness in Python — dual-layer architecture, multi-provider LLM support, tool registry patterns, and production optimisation techniques.

This guide covers common production patterns for building a Python-based AI agent harness. These patterns are drawn from established open-source agent frameworks and represent proven approaches to agent orchestration, tool management, and multi-provider LLM integration.

Architecture at a Glance

Dual-layer design optimised for speed and maintainability:

┌─────────────────────────────────────────────────┐
│  Python Layer                                   │
│  - Agent orchestration & LLM integration        │
│  - Session management, command routing          │
│  - Tool registry & higher-level logic           │
│  → Easy to extend, modify, learn from          │
└─────────────────────────────────────────────────┘
          ↓ (calls for performance-critical ops)
┌─────────────────────────────────────────────────┐
│  Compiled Layer (Rust, C++, or Go)              │
│  - Terminal rendering, protocol handling        │
│  - Tool execution runtime, memory-safe ops      │
│  - Zero-dependency JSON parsing                 │
│  → Fast, safe, production-ready                │
└─────────────────────────────────────────────────┘

Why this split:

  • Python: Rapid iteration, readable, LLM-friendly
  • Compiled language: Performance critical, memory safety, zero dependencies

Core Components

A well-structured Python agent harness contains these key modules:

ModulePurposeExtend for…
main.pyCLI entry pointAdding new slash commands
agent_runtime.pyMain agent loopCore reasoning patterns
agent_tools.pyTool definitionsAdding new tools
agent_registry.pyBuilt-in tool registryCustom tool discovery
agent_session.pySession stateMemory management
models.pyLLM message structuresCustom message types

Multi-Provider LLM Support

A key advantage of building your own harness is abstracting the LLM provider:

# Choose your LLM with simple config
models = {
    "claude": "claude-sonnet-4",                   # Anthropic (model IDs include date suffixes that change with releases)
    "openai": "gpt-4o",                          # OpenAI
    "gemini": "gemini-2-0-flash",               # Google
    "ollama": "qwen2.5:27b",                    # Local
    "huggingface": "meta-llama/Llama-3-70b"    # HF API
}

Cost implications:

  • Claude 3.5 Sonnet: ~$3/1M tokens
  • GPT-4o: ~$15/1M tokens
  • Local Qwen 27B (4-bit): $0/token + electricity (~$10/month)

Common Tool Categories

A production agent harness typically includes tools across these categories:

File & Directory Operations

  • file_read — Read file contents
  • file_write — Create or update files
  • file_delete — Delete files
  • file_list — List directory contents
  • file_search — Grep across codebase

Code Execution

  • bash_exec — Run shell commands (with security checks)
  • git_operations — Clone, commit, push, pull
  • lsp_integration — Code intelligence (definitions, references, hover)

External APIs & Knowledge

  • web_scrape — Fetch & parse web content
  • web_search — Search engines (Bing/Google)
  • http_request — Make HTTP calls

Agent Spawning

  • spawn_agent — Launch subagents
  • agent_manage — List, stop, inspect agents

Advanced

  • mcp_server_list — List connected MCP servers
  • canvas_render — Terminal-based rich output
  • oauth_flow — Handle authentication (PKCE)

Model Context Protocol (MCP) Integration

Extend tool capabilities via MCP servers. Common transport types:

# stdio: Local process
mcp_servers = [
    {
        "transport": "stdio",
        "command": "python",
        "args": ["-m", "my_mcp_server"]
    }
]

# HTTP: Remote service
mcp_servers = [
    {
        "transport": "http",
        "url": "http://localhost:3000/mcp"
    }
]

# WebSocket, SSE, SDK also supported

Benefit: Drop in any MCP server without modifying your harness. Gets all its tools automatically.

Choosing Your Approach

FeatureVendor CLI (e.g. Claude Code)Custom Python Harness
InstallationSingle commandGit + Python setup
Model SupportVendor-lockedAny provider + local
IDE IntegrationDeep (VS Code, JetBrains)Terminal or custom
ExtensibilityLimitedFilesystem + MCP
CostSubscriptionPay-as-you-go
MaturityProduction-testedYou control quality
Self-hostingNoYes
Learning ValueOpaqueFull transparency

Performance Characteristics

Speed (Tokens/Second):

  • Qwen 27B (RTX 4090, quantised): ~40 tokens/sec
  • Claude 3.5 Sonnet (API): 80-150 tokens/sec
  • Ollama 7B (M1 MacBook): 15-20 tokens/sec

Memory Usage:

  • Recommendations by hardware:
    • 24GB+ VRAM: Safe to use 128K context
    • 16GB VRAM: Keep to 32K context (above causes quality drop)
    • 8GB VRAM: Stay at 8K context (severely constrained)

Cost Analysis (typical developer, 2M tokens/month):

StrategyMonthly Cost
Cloud only (Sonnet)$60
Local only (Qwen)$0 (+electricity $10-20)
Hybrid routing$12-30 (up to 80-90% reduction when most requests route locally)

Hardware ROI:

  • RTX 4090: ~$1,500, pays for itself in 7.5-10 months
  • Profitable for 3-5 year lifespan

Building Your Own Harness: Practical Patterns

Pattern 1: Basic Single-Agent Harness

# my_harness.py
from pathlib import Path
from anthropic import Anthropic

class MyHarness:
    def __init__(self, model: str = "claude-sonnet-4"):  # model IDs include date suffixes that change with releases
        self.client = Anthropic()
        self.model = model
        self.conversation_history = []
        self.tools = self._define_tools()
    
    def _define_tools(self):
        """Define available tools"""
        return [
            {
                "name": "read_file",
                "description": "Read contents of a file",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "path": {
                            "type": "string",
                            "description": "File path to read"
                        }
                    },
                    "required": ["path"]
                }
            },
            {
                "name": "write_file",
                "description": "Write contents to a file",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string"},
                        "content": {"type": "string"}
                    },
                    "required": ["path", "content"]
                }
            },
            {
                "name": "bash_exec",
                "description": "Execute bash command",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "command": {
                            "type": "string",
                            "description": "Command to execute"
                        }
                    },
                    "required": ["command"]
                }
            }
        ]
    
    async def run(self, task: str) -> str:
        """Execute task with agentic loop"""
        self.conversation_history.append({
            "role": "user",
            "content": task
        })
        
        while True:
            # Call LLM
            response = self.client.messages.create(
                model=self.model,
                max_tokens=4096,
                tools=self.tools,
                messages=self.conversation_history
            )
            
            # Check for tool use
            has_tool_use = False
            for block in response.content:
                if block.type == "tool_use":
                    has_tool_use = True
                    tool_result = self._execute_tool(
                        block.name,
                        block.input
                    )
                    
                    self.conversation_history.append({
                        "role": "assistant",
                        "content": response.content
                    })
                    self.conversation_history.append({
                        "role": "user",
                        "content": [
                            {
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": tool_result
                            }
                        ]
                    })
                    break
            
            if not has_tool_use:
                # Final response
                return response.content[0].text
    
    def _execute_tool(self, tool_name: str, tool_input: dict) -> str:
        """Execute a tool and return result"""
        try:
            if tool_name == "read_file":
                path = Path(tool_input["path"])
                return path.read_text()
            
            elif tool_name == "write_file":
                path = Path(tool_input["path"])
                path.write_text(tool_input["content"])
                return f"Written {len(tool_input['content'])} bytes to {path}"
            
            elif tool_name == "bash_exec":
                import subprocess
                result = subprocess.run(
                    tool_input["command"],
                    shell=True,
                    capture_output=True,
                    text=True
                )
                return f"Exit code: {result.returncode}\n{result.stdout}{result.stderr}"
        
        except Exception as e:
            return f"Error: {str(e)}"

# Usage
if __name__ == "__main__":
    import asyncio
    
    harness = MyHarness()
    task = "Create a Python script that generates Fibonacci sequence up to 10 numbers"
    result = asyncio.run(harness.run(task))
    print(result)

Pattern 2: Hybrid Cloud/Local Routing

class HybridHarness(MyHarness):
    """Route to cloud or local based on complexity"""
    
    def __init__(self):
        super().__init__()
        self.cloud_model = "claude-sonnet-4"  # model IDs include date suffixes that change with releases
        self.local_model = "ollama/qwen2.5:27b"
        self.use_ollama = True
    
    async def run(self, task: str) -> str:
        """Route to appropriate model"""
        complexity = self._estimate_complexity(task)
        
        if complexity > 0.7:
            print(f"[Complex task, {complexity:.1%}] Using cloud model")
            self.model = self.cloud_model
            self.use_ollama = False
        else:
            print(f"[Simple task, {complexity:.1%}] Using local model")
            self.model = self.local_model
            self.use_ollama = True
        
        return await super().run(task)
    
    def _estimate_complexity(self, text: str) -> float:
        """Heuristic complexity scorer"""
        keywords = {
            "architecture": 0.8,
            "design": 0.7,
            "refactor": 0.6,
            "bug": 0.3,
            "implement": 0.7,
            "optimize": 0.5
        }
        
        score = 0.0
        for keyword, weight in keywords.items():
            if keyword in text.lower():
                score = max(score, weight)
        
        # Length adjustment
        score += min(len(text) / 1000, 0.2)
        
        return min(score, 1.0)

Pattern 3: Custom Tool Registry (Extensibility)

# custom_tools/slack_integration.py
class SlackTool:
    name = "send_slack_message"
    description = "Send message to Slack channel"
    
    schema = {
        "name": "send_slack_message",
        "description": description,
        "input_schema": {
            "type": "object",
            "properties": {
                "channel": {
                    "type": "string",
                    "description": "Slack channel ID or name"
                },
                "message": {
                    "type": "string",
                    "description": "Message to send"
                }
            },
            "required": ["channel", "message"]
        }
    }
    
    async def execute(self, channel: str, message: str) -> str:
        import os
        from slack_sdk import WebClient
        
        client = WebClient(token=os.environ.get("SLACK_BOT_TOKEN"))
        response = client.chat_postMessage(
            channel=channel,
            text=message
        )
        return f"Message sent to {channel} (ts={response['ts']})"

# custom_tools/github_integration.py
class GitHubTool:
    name = "create_github_issue"
    # ... similar pattern
# harness_with_custom_tools.py
import importlib.util
from pathlib import Path

class ExtensibleHarness(MyHarness):
    def __init__(self):
        super().__init__()
        self.custom_tools = self._load_custom_tools()
        self.tools.extend(self.custom_tools)
    
    def _load_custom_tools(self):
        """Dynamically load tools from custom_tools/ directory"""
        tools = []
        
        for tool_file in Path("custom_tools").glob("*.py"):
            if tool_file.name.startswith("_"):
                continue
            
            spec = importlib.util.spec_from_file_location(
                tool_file.stem,
                tool_file
            )
            module = importlib.util.module_from_spec(spec)
            spec.loader.exec_module(module)
            
            # Find classes with schema attribute
            for attr_name in dir(module):
                attr = getattr(module, attr_name)
                if (hasattr(attr, "schema") and 
                    hasattr(attr, "execute")):
                    tool_instance = attr()
                    tools.append(tool_instance.schema)
        
        return tools

Common Challenges & Solutions

1. Context Window Exhaustion

Context window exhaustion (OOM errors):

# Solution: Implement context trimming
def trim_old_messages(self, max_messages: int = 20):
    """Keep only recent messages in context"""
    if len(self.conversation_history) > max_messages:
        # Summarize old messages
        old_messages = self.conversation_history[:-max_messages]
        summary = f"[Previous conversation: {len(old_messages)} messages summarized]"
        self.conversation_history = [
            {"role": "user", "content": summary}
        ] + self.conversation_history[-max_messages:]

2. API Rate Limiting

API rate limiting (429 errors):

# Solution: Exponential backoff
import time
from random import uniform

def call_llm_with_backoff(self, max_retries: int = 3):
    for attempt in range(max_retries):
        try:
            return self.client.messages.create(...)
        except Exception as e:
            if "429" in str(e):
                wait_time = (2 ** attempt) + uniform(0, 1)
                print(f"Rate limited, waiting {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                raise

3. Local Model Quality Degradation

  • Use AWQ 4-bit quantisation (38% throughput, <2% quality loss)
  • Hybrid routing (use cloud for complex tasks)
  • Fine-tune adapter for domain-specific knowledge

Performance Optimisation Techniques

1. Prompt Caching

# Cache system prompts to avoid reprocessing
SYSTEM_PROMPT_HASH = "abc123"
cache_control = {
    "type": "ephemeral"
}

2. Token Compression

def compress_old_context(self, max_age_turns: int = 10):
    """Summarize old turns to preserve context"""
    if len(self.conversation_history) > max_age_turns:
        old = self.conversation_history[:-max_age_turns]
        summary = f"[Summarized {len(old)} old turns]"
        self.conversation_history = [
            {"role": "user", "content": summary}
        ] + self.conversation_history[-max_age_turns:]

3. Tool Result Caching

from functools import lru_cache

@lru_cache(maxsize=128)
def cached_read_file(path: str) -> str:
    """Cache file reads within a session"""
    return Path(path).read_text()

When to Use a Framework vs Build Custom

Use an existing framework if:

  • Want to learn agent architecture (clean, understandable code)
  • Need multi-provider LLM support (Claude, OpenAI, local)
  • Building a terminal-based agent tool
  • Want extensibility via filesystem + MCP
  • Need cost optimisation (hybrid routing)
  • Prefer open-source transparency

Build custom if:

  • Highly specialised domain (medical, finance, code-specific)
  • Need IDE integration
  • Require advanced memory patterns
  • Working with proprietary infrastructure
  • Want zero external dependencies

Next Steps

  1. Study the patterns above: Understand the agentic loop and tool registry (1-2 hours)
  2. Build a basic harness: Start with Pattern 1 — single agent, 3 tools
  3. Add custom tools: Create tools in a custom_tools/ directory
  4. Optimise: Implement hybrid routing, context compression
  5. Deploy: Package as Python module or Docker container

References


Validation Checklist

How do you know you got this right?

Performance Checks

  • Harness runs first command in <10 seconds (including startup)
  • Second+ commands run in <5 seconds (no repeated initialisation)
  • Tool execution latency <2 seconds per tool call
  • Context window management prevents OOM: trim or summarise past messages
  • Hybrid routing works: simple tasks run on local, complex on cloud

Implementation Checks

  • Installation completed: dependencies installed or Docker build succeeds
  • LLM provider configured: API key set, test query succeeds
  • Tool definitions loaded: all core tools available
  • MCP servers connected (if used): server registration and tool discovery working
  • Custom tools added to registry: 2+ custom tools discoverable
  • Session persistence working: conversation history maintained across runs

Integration Checks

  • Tool calls parse successfully: model output matches input schema
  • Error handling: failed tool doesn’t crash loop, agent tries next action
  • Multi-provider routing: can switch between Claude/OpenAI/Ollama without code change
  • Hybrid cost/quality: simple tasks fast+cheap, complex tasks accurate
  • Context trimming: conversation history doesn’t cause OOM even after 100+ turns

Common Failure Modes

  • Tool schema mismatch: Tool definition in registry doesn’t match implementation signature
  • MCP server not connecting: Protocol/transport type incorrect or server not running
  • Hybrid routing too conservative: Using cloud for every task; refine complexity heuristic
  • Context window fills too fast: Not trimming old messages; implement token budget
  • Local model quality degradation: Base model too small for task; upgrade to 27B+ or use cloud

Sign-Off Criteria

  • Basic harness running with 3+ tools
  • LLM provider working: generating code, executing tools, parsing responses
  • Custom tools added and tested (2+ examples)
  • Performance benchmarks met: latency, memory, cost targets
  • Migration path clear: understanding how to extend for your use case

See Also

  • Doc 03 (Hugging Face): Selecting local models for hybrid routing
  • Doc 06 (Harness Architecture): How these patterns map to the 7 components
  • Doc 09 (Operations): Monitoring harness instances in production