Building a Python Agent Harness
Building an AI agent harness in Python — dual-layer architecture, multi-provider LLM support, tool registry patterns, and production optimisation techniques.
This guide covers common production patterns for building a Python-based AI agent harness. These patterns are drawn from established open-source agent frameworks and represent proven approaches to agent orchestration, tool management, and multi-provider LLM integration.
Architecture at a Glance
Dual-layer design optimised for speed and maintainability:
┌─────────────────────────────────────────────────┐
│ Python Layer │
│ - Agent orchestration & LLM integration │
│ - Session management, command routing │
│ - Tool registry & higher-level logic │
│ → Easy to extend, modify, learn from │
└─────────────────────────────────────────────────┘
↓ (calls for performance-critical ops)
┌─────────────────────────────────────────────────┐
│ Compiled Layer (Rust, C++, or Go) │
│ - Terminal rendering, protocol handling │
│ - Tool execution runtime, memory-safe ops │
│ - Zero-dependency JSON parsing │
│ → Fast, safe, production-ready │
└─────────────────────────────────────────────────┘
Why this split:
- Python: Rapid iteration, readable, LLM-friendly
- Compiled language: Performance critical, memory safety, zero dependencies
Core Components
A well-structured Python agent harness contains these key modules:
| Module | Purpose | Extend for… |
|---|---|---|
main.py | CLI entry point | Adding new slash commands |
agent_runtime.py | Main agent loop | Core reasoning patterns |
agent_tools.py | Tool definitions | Adding new tools |
agent_registry.py | Built-in tool registry | Custom tool discovery |
agent_session.py | Session state | Memory management |
models.py | LLM message structures | Custom message types |
Multi-Provider LLM Support
A key advantage of building your own harness is abstracting the LLM provider:
# Choose your LLM with simple config
models = {
"claude": "claude-sonnet-4", # Anthropic (model IDs include date suffixes that change with releases)
"openai": "gpt-4o", # OpenAI
"gemini": "gemini-2-0-flash", # Google
"ollama": "qwen2.5:27b", # Local
"huggingface": "meta-llama/Llama-3-70b" # HF API
}
Cost implications:
- Claude 3.5 Sonnet: ~$3/1M tokens
- GPT-4o: ~$15/1M tokens
- Local Qwen 27B (4-bit): $0/token + electricity (~$10/month)
Common Tool Categories
A production agent harness typically includes tools across these categories:
File & Directory Operations
file_read— Read file contentsfile_write— Create or update filesfile_delete— Delete filesfile_list— List directory contentsfile_search— Grep across codebase
Code Execution
bash_exec— Run shell commands (with security checks)git_operations— Clone, commit, push, pulllsp_integration— Code intelligence (definitions, references, hover)
External APIs & Knowledge
web_scrape— Fetch & parse web contentweb_search— Search engines (Bing/Google)http_request— Make HTTP calls
Agent Spawning
spawn_agent— Launch subagentsagent_manage— List, stop, inspect agents
Advanced
mcp_server_list— List connected MCP serverscanvas_render— Terminal-based rich outputoauth_flow— Handle authentication (PKCE)
Model Context Protocol (MCP) Integration
Extend tool capabilities via MCP servers. Common transport types:
# stdio: Local process
mcp_servers = [
{
"transport": "stdio",
"command": "python",
"args": ["-m", "my_mcp_server"]
}
]
# HTTP: Remote service
mcp_servers = [
{
"transport": "http",
"url": "http://localhost:3000/mcp"
}
]
# WebSocket, SSE, SDK also supported
Benefit: Drop in any MCP server without modifying your harness. Gets all its tools automatically.
Choosing Your Approach
| Feature | Vendor CLI (e.g. Claude Code) | Custom Python Harness |
|---|---|---|
| Installation | Single command | Git + Python setup |
| Model Support | Vendor-locked | Any provider + local |
| IDE Integration | Deep (VS Code, JetBrains) | Terminal or custom |
| Extensibility | Limited | Filesystem + MCP |
| Cost | Subscription | Pay-as-you-go |
| Maturity | Production-tested | You control quality |
| Self-hosting | No | Yes |
| Learning Value | Opaque | Full transparency |
Performance Characteristics
Speed (Tokens/Second):
- Qwen 27B (RTX 4090, quantised): ~40 tokens/sec
- Claude 3.5 Sonnet (API): 80-150 tokens/sec
- Ollama 7B (M1 MacBook): 15-20 tokens/sec
Memory Usage:
- Recommendations by hardware:
- 24GB+ VRAM: Safe to use 128K context
- 16GB VRAM: Keep to 32K context (above causes quality drop)
- 8GB VRAM: Stay at 8K context (severely constrained)
Cost Analysis (typical developer, 2M tokens/month):
| Strategy | Monthly Cost |
|---|---|
| Cloud only (Sonnet) | $60 |
| Local only (Qwen) | $0 (+electricity $10-20) |
| Hybrid routing | $12-30 (up to 80-90% reduction when most requests route locally) |
Hardware ROI:
- RTX 4090: ~$1,500, pays for itself in 7.5-10 months
- Profitable for 3-5 year lifespan
Building Your Own Harness: Practical Patterns
Pattern 1: Basic Single-Agent Harness
# my_harness.py
from pathlib import Path
from anthropic import Anthropic
class MyHarness:
def __init__(self, model: str = "claude-sonnet-4"): # model IDs include date suffixes that change with releases
self.client = Anthropic()
self.model = model
self.conversation_history = []
self.tools = self._define_tools()
def _define_tools(self):
"""Define available tools"""
return [
{
"name": "read_file",
"description": "Read contents of a file",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "File path to read"
}
},
"required": ["path"]
}
},
{
"name": "write_file",
"description": "Write contents to a file",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
}
},
{
"name": "bash_exec",
"description": "Execute bash command",
"input_schema": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "Command to execute"
}
},
"required": ["command"]
}
}
]
async def run(self, task: str) -> str:
"""Execute task with agentic loop"""
self.conversation_history.append({
"role": "user",
"content": task
})
while True:
# Call LLM
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
tools=self.tools,
messages=self.conversation_history
)
# Check for tool use
has_tool_use = False
for block in response.content:
if block.type == "tool_use":
has_tool_use = True
tool_result = self._execute_tool(
block.name,
block.input
)
self.conversation_history.append({
"role": "assistant",
"content": response.content
})
self.conversation_history.append({
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": block.id,
"content": tool_result
}
]
})
break
if not has_tool_use:
# Final response
return response.content[0].text
def _execute_tool(self, tool_name: str, tool_input: dict) -> str:
"""Execute a tool and return result"""
try:
if tool_name == "read_file":
path = Path(tool_input["path"])
return path.read_text()
elif tool_name == "write_file":
path = Path(tool_input["path"])
path.write_text(tool_input["content"])
return f"Written {len(tool_input['content'])} bytes to {path}"
elif tool_name == "bash_exec":
import subprocess
result = subprocess.run(
tool_input["command"],
shell=True,
capture_output=True,
text=True
)
return f"Exit code: {result.returncode}\n{result.stdout}{result.stderr}"
except Exception as e:
return f"Error: {str(e)}"
# Usage
if __name__ == "__main__":
import asyncio
harness = MyHarness()
task = "Create a Python script that generates Fibonacci sequence up to 10 numbers"
result = asyncio.run(harness.run(task))
print(result)
Pattern 2: Hybrid Cloud/Local Routing
class HybridHarness(MyHarness):
"""Route to cloud or local based on complexity"""
def __init__(self):
super().__init__()
self.cloud_model = "claude-sonnet-4" # model IDs include date suffixes that change with releases
self.local_model = "ollama/qwen2.5:27b"
self.use_ollama = True
async def run(self, task: str) -> str:
"""Route to appropriate model"""
complexity = self._estimate_complexity(task)
if complexity > 0.7:
print(f"[Complex task, {complexity:.1%}] Using cloud model")
self.model = self.cloud_model
self.use_ollama = False
else:
print(f"[Simple task, {complexity:.1%}] Using local model")
self.model = self.local_model
self.use_ollama = True
return await super().run(task)
def _estimate_complexity(self, text: str) -> float:
"""Heuristic complexity scorer"""
keywords = {
"architecture": 0.8,
"design": 0.7,
"refactor": 0.6,
"bug": 0.3,
"implement": 0.7,
"optimize": 0.5
}
score = 0.0
for keyword, weight in keywords.items():
if keyword in text.lower():
score = max(score, weight)
# Length adjustment
score += min(len(text) / 1000, 0.2)
return min(score, 1.0)
Pattern 3: Custom Tool Registry (Extensibility)
# custom_tools/slack_integration.py
class SlackTool:
name = "send_slack_message"
description = "Send message to Slack channel"
schema = {
"name": "send_slack_message",
"description": description,
"input_schema": {
"type": "object",
"properties": {
"channel": {
"type": "string",
"description": "Slack channel ID or name"
},
"message": {
"type": "string",
"description": "Message to send"
}
},
"required": ["channel", "message"]
}
}
async def execute(self, channel: str, message: str) -> str:
import os
from slack_sdk import WebClient
client = WebClient(token=os.environ.get("SLACK_BOT_TOKEN"))
response = client.chat_postMessage(
channel=channel,
text=message
)
return f"Message sent to {channel} (ts={response['ts']})"
# custom_tools/github_integration.py
class GitHubTool:
name = "create_github_issue"
# ... similar pattern
# harness_with_custom_tools.py
import importlib.util
from pathlib import Path
class ExtensibleHarness(MyHarness):
def __init__(self):
super().__init__()
self.custom_tools = self._load_custom_tools()
self.tools.extend(self.custom_tools)
def _load_custom_tools(self):
"""Dynamically load tools from custom_tools/ directory"""
tools = []
for tool_file in Path("custom_tools").glob("*.py"):
if tool_file.name.startswith("_"):
continue
spec = importlib.util.spec_from_file_location(
tool_file.stem,
tool_file
)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
# Find classes with schema attribute
for attr_name in dir(module):
attr = getattr(module, attr_name)
if (hasattr(attr, "schema") and
hasattr(attr, "execute")):
tool_instance = attr()
tools.append(tool_instance.schema)
return tools
Common Challenges & Solutions
1. Context Window Exhaustion
Context window exhaustion (OOM errors):
# Solution: Implement context trimming
def trim_old_messages(self, max_messages: int = 20):
"""Keep only recent messages in context"""
if len(self.conversation_history) > max_messages:
# Summarize old messages
old_messages = self.conversation_history[:-max_messages]
summary = f"[Previous conversation: {len(old_messages)} messages summarized]"
self.conversation_history = [
{"role": "user", "content": summary}
] + self.conversation_history[-max_messages:]
2. API Rate Limiting
API rate limiting (429 errors):
# Solution: Exponential backoff
import time
from random import uniform
def call_llm_with_backoff(self, max_retries: int = 3):
for attempt in range(max_retries):
try:
return self.client.messages.create(...)
except Exception as e:
if "429" in str(e):
wait_time = (2 ** attempt) + uniform(0, 1)
print(f"Rate limited, waiting {wait_time:.1f}s...")
time.sleep(wait_time)
else:
raise
3. Local Model Quality Degradation
- Use AWQ 4-bit quantisation (38% throughput, <2% quality loss)
- Hybrid routing (use cloud for complex tasks)
- Fine-tune adapter for domain-specific knowledge
Performance Optimisation Techniques
1. Prompt Caching
# Cache system prompts to avoid reprocessing
SYSTEM_PROMPT_HASH = "abc123"
cache_control = {
"type": "ephemeral"
}
2. Token Compression
def compress_old_context(self, max_age_turns: int = 10):
"""Summarize old turns to preserve context"""
if len(self.conversation_history) > max_age_turns:
old = self.conversation_history[:-max_age_turns]
summary = f"[Summarized {len(old)} old turns]"
self.conversation_history = [
{"role": "user", "content": summary}
] + self.conversation_history[-max_age_turns:]
3. Tool Result Caching
from functools import lru_cache
@lru_cache(maxsize=128)
def cached_read_file(path: str) -> str:
"""Cache file reads within a session"""
return Path(path).read_text()
When to Use a Framework vs Build Custom
Use an existing framework if:
- Want to learn agent architecture (clean, understandable code)
- Need multi-provider LLM support (Claude, OpenAI, local)
- Building a terminal-based agent tool
- Want extensibility via filesystem + MCP
- Need cost optimisation (hybrid routing)
- Prefer open-source transparency
Build custom if:
- Highly specialised domain (medical, finance, code-specific)
- Need IDE integration
- Require advanced memory patterns
- Working with proprietary infrastructure
- Want zero external dependencies
Next Steps
- Study the patterns above: Understand the agentic loop and tool registry (1-2 hours)
- Build a basic harness: Start with Pattern 1 — single agent, 3 tools
- Add custom tools: Create tools in a
custom_tools/directory - Optimise: Implement hybrid routing, context compression
- Deploy: Package as Python module or Docker container
References
Validation Checklist
How do you know you got this right?
Performance Checks
- Harness runs first command in <10 seconds (including startup)
- Second+ commands run in <5 seconds (no repeated initialisation)
- Tool execution latency <2 seconds per tool call
- Context window management prevents OOM: trim or summarise past messages
- Hybrid routing works: simple tasks run on local, complex on cloud
Implementation Checks
- Installation completed: dependencies installed or Docker build succeeds
- LLM provider configured: API key set, test query succeeds
- Tool definitions loaded: all core tools available
- MCP servers connected (if used): server registration and tool discovery working
- Custom tools added to registry: 2+ custom tools discoverable
- Session persistence working: conversation history maintained across runs
Integration Checks
- Tool calls parse successfully: model output matches input schema
- Error handling: failed tool doesn’t crash loop, agent tries next action
- Multi-provider routing: can switch between Claude/OpenAI/Ollama without code change
- Hybrid cost/quality: simple tasks fast+cheap, complex tasks accurate
- Context trimming: conversation history doesn’t cause OOM even after 100+ turns
Common Failure Modes
- Tool schema mismatch: Tool definition in registry doesn’t match implementation signature
- MCP server not connecting: Protocol/transport type incorrect or server not running
- Hybrid routing too conservative: Using cloud for every task; refine complexity heuristic
- Context window fills too fast: Not trimming old messages; implement token budget
- Local model quality degradation: Base model too small for task; upgrade to 27B+ or use cloud
Sign-Off Criteria
- Basic harness running with 3+ tools
- LLM provider working: generating code, executing tools, parsing responses
- Custom tools added and tested (2+ examples)
- Performance benchmarks met: latency, memory, cost targets
- Migration path clear: understanding how to extend for your use case
See Also
- Doc 03 (Hugging Face): Selecting local models for hybrid routing
- Doc 06 (Harness Architecture): How these patterns map to the 7 components
- Doc 09 (Operations): Monitoring harness instances in production