The Harness Handbook — Start Here — The Harness Handbook Reference

A comprehensive guide to AI/ML engineering across foundations, hardware, agent harnesses, and real-world applications. Whether you’re building models, choosing hardware, designing agents, or deploying to production, this handbook provides the knowledge to succeed.

Status: April 2026 | Covers KV cache optimization (GQA, PagedAttention, TurboQuant), Karpathy’s LLM Wiki pattern, and production harness architecture

Handbook Structure: Four Parts

The handbook is organized into four major parts, each serving different needs:

Part 1: AI/ML Foundations (Docs 21-22)

Understand how AI models work at a fundamental level.

Doc 21: Model Fundamentals — Weights, parameters, neural networks, transformers, training
Doc 22: Knowledge Transfer Methods — Distillation, fine-tuning, LoRA, RAG

Part 2: Hardware & Systems (Docs 23-24, 26, 28)

Choose the right hardware and understand system-level concerns.

Doc 24: Hardware Landscape — CPU vs GPU, NVIDIA, Apple Silicon, mobile chips, cost/performance trade-offs
Docs 23, 26, 28: (Additional hardware and systems topics)

Part 3: Agent Harnesses (Docs 01-20)

Design, build, and deploy production AI agent systems.

Foundation models, agents, memory, security, testing, deployment
Reference implementations and patterns
This is the original core curriculum

Part 4: Real-World Applications (Docs 25, 27)

Learn how to apply harnesses to specific domains and use cases.

What Happened? Handbook Improvements (April 18, 2026)

This handbook was restructured and expanded from “Harness Corpus” to comprehensive AI/ML handbook. Key improvements:

Expanded glossary (glossary.md) — 125+ terms covering foundations, hardware, and applications
New Part 1 & 2 — Model fundamentals and hardware landscape added
Three critical foundation documents — Operations & Observability (09), Security & Safety (10), Testing & QA (11)
Deployment guide (12_deployment_patterns.md) — Docker, Kubernetes, CI/CD patterns
Single primary path — Clearer “Recommended Path” with alternatives clearly marked
Phase labeling — Documents marked as Phase 1 (Critical) or Phase 2 (Important)
Production deployment checklist — Clear pre-deployment validation steps
Consolidated performance metrics — One reference table for all benchmarks
“Next Steps After Week 6” — Advanced topics and continuous learning
New “Quick Reference by Domain” — Find answers by role and goal

New to these terms? → See glossary.md for definitions of Agent, Harness, LLM, Token, ReAct, and 75+ other key concepts.

Recommended Primary Path

Choose based on your background:

Option A: Fast Track (For Practitioners, 5-6 weeks)

01_foundation_models.md — What models exist and when to use them
05_ai_agents.md — How agents think and decide (ReAct, Chain-of-Thought, Tree of Thoughts, etc.)
06_harness_architecture.md — Seven components of a complete harness
08_claw_code_python.md — Reference implementation in Python
04_memory_systems.md — Memory systems and the LLM Wiki pattern (compiled markdown knowledge)
02_kv_cache_optimization.md — Optimization for long contexts

Then proceed to “Production Deployment” below.

Option B: Foundations-First (For Researchers, 7-8 weeks)

21_model_fundamentals.md — How neural networks, transformers, and weights actually work
22_knowledge_transfer_methods.md — Distillation, fine-tuning, LoRA, RAG (theoretical foundations)
01_foundation_models.md — Practical model selection
05_ai_agents.md — Agentic reasoning frameworks
06_harness_architecture.md — Harness design
24_hardware_landscape.md — Hardware understanding (CPU, GPU, Neural Engines, cost/performance)
Rest of Option A

Then, choose your path based on your needs:

For Production Deployment (After step 4 in Option A or step 6 in Option B): → 11_testing_and_qa.md (establish quality baselines) [PHASE 1 - CRITICAL] → 09_operations_and_observability.md (monitoring & debugging) [PHASE 1 - CRITICAL] → 10_security_and_safety.md (security hardening) [PHASE 1 - CRITICAL] → 12_deployment_patterns.md (containerization & orchestration) [PHASE 2 - Important]

Alternative Starting Points

If you’re building in Python right now → Jump directly to 08_claw_code_python.md, then revisit foundational docs as needed

If you only care about optimization → 02_kv_cache_optimization.md + 03_huggingface_ecosystem.md (quantization specifics)

If you’re researching agents → 05_ai_agents.md → 04_memory_systems.md → then reference implementations in 08

Quick lookup by topic:

What are weights and how do neural networks work? → 21_model_fundamentals.md (weights, parameters, neurons, layers, transformers)
What’s the difference between distillation and fine-tuning? → 22_knowledge_transfer_methods.md
Should I buy RTX 4090 or MacBook Pro? → 24_hardware_landscape.md (GPU vs CPU, Apple Silicon, unified memory, cost/performance)
How do I choose a model? → 01_foundation_models.md + 03_huggingface_ecosystem.md
How do I optimize inference? → 02_kv_cache_optimization.md (GQA, PagedAttention, KV cache quantization, TurboQuant)
How do I build memory systems? → 04_memory_systems.md (includes Karpathy’s LLM Wiki pattern for compiled markdown knowledge)
How do agents work? → 05_ai_agents.md (ReAct, ToT, Reflexion frameworks)
What are the components of a harness? → 06_harness_architecture.md
I’m building in Python, where do I start? → 08_claw_code_python.md (installation + patterns)
How do I monitor & debug production harnesses? → 09_operations_and_observability.md (logging, metrics, cost tracking, debugging) [PHASE 1]
How do I protect my harness from attack? → 10_security_and_safety.md (injection, validation, rate limiting, compliance) [PHASE 1]
How do I test my harness before production? → 11_testing_and_qa.md (non-deterministic testing, regression, quality metrics) [PHASE 1]
How do I deploy my harness to production? → 12_deployment_patterns.md (Docker, Kubernetes, CI/CD, scaling) [PHASE 2]
How do I embed a harness into my application? → 20_integration_patterns.md (REST API, background jobs, events, GraphQL, WebSocket, Slack/Discord bots) [PHASE 2]

By role…

Software Engineer Building Python Harness (Use Option A: Fast Track)

Phase 1 (Critical): 08 (Claw-Code), 05 (Agents), 06 (Architecture), 04 (Memory), 11 (Testing), 09 (Monitoring), 10 (Security)
Phase 2: 12 (Deployment), 20 (Integration), 03 (HF ecosystem), 02 (Optimization)
Optional Reference: 01 (Models), 21 (Fundamentals) — read when you need theoretical grounding

ML Engineer / ML Researcher (Use Option B: Foundations-First)

Phase 1 Foundations: 21 (Model fundamentals), 22 (Knowledge transfer methods), 24 (Hardware landscape)
Phase 2 Applications: 01 (Foundation models), 03 (Hugging Face), 04 (Memory systems), 02 (KV cache)
Deep dive topics: KV cache techniques in doc 02, LLM Wiki pattern in 04, quantization details in 03
Build production: 05 (Agents) → 06 (Architecture) → 08 (Implementation) → 09-11 (Quality gates)

Hardware Engineer / Systems Builder

Week 1: 24 (Hardware landscape) — CPU vs GPU, Apple Silicon, mobile chips, unified memory, cost/performance analysis
Week 2: 21 (Model fundamentals), 02 (KV cache), 01 (Model selection) — understand how hardware impacts inference
Week 3: 28 (Unified memory economics) — deep dive into Apple M-series advantages
Week 4: 09 (Observability), 12 (Deployment) — production concerns
Focus areas: VRAM requirements, thermal management, cost per inference, memory bandwidth

Roboticist / Physical AI Engineer

Foundations: 21 (Model fundamentals), 24 (Hardware) — especially mobile/edge sections
Agentic systems: 05 (Agents), 06 (Harness architecture)
On-device inference: 23 (Apple Intelligence & CoreML), 25 (Edge & Physical AI)
Real-world applications: 27 (Real-world AI applications — section on robotics/autonomous vehicles)
Production setup: 09 (Observability for robot telemetry), 12 (Deployment)
Then build: 08 (Implementation), 04 (Memory for robot decision-making)

Learning Agentic AI from Scratch (Complete Beginner)

Start with fundamentals: Option B path (21 → 22 → 01 → 05 → 06 → 08)
Then master architecture: 04 (Memory) → 02 (Optimization) → 03 (Model ecosystem)
Then build for production: 11 (Testing) → 09 (Monitoring) → 10 (Security) → 12 (Deployment)
Skip until ready: 07 (Open-source agent architectures) — only needed after understanding core concepts
Reference: glossary.md for any unfamiliar terms

Product Manager / Architecture Decision Maker

Understand capabilities: 01 (Foundation models), 05 (Reasoning frameworks) — why harnesses make sense
Understand architecture: 06 (Seven harness components), 08 (Python implementation)
Cost decisions: 13 (Cost management) + 24 (Hardware landscape) — understand potential cost savings with hybrid routing
Risk management: 10 (Security & safety), 17 (Regulatory & ethics)
Key insights: Cost (SLMs 10-30× cheaper), Speed (agent loops 100-1000× faster), Capability (LLMs for verification)

Wave 4 Documents: When to Read These

Wave 4 (Docs 21-28) are the “AI/ML Foundations & Hardware” section. They’re recommended for everyone eventually, but timing matters:

Your Background	When to Read Wave 4	Priority
Software engineer (practicing)	After understanding harnesses (docs 01-08), read 21-22 when you want deeper model knowledge	Optional
ML engineer (doing research)	Read first (21-22, 24), before diving into harness-specific docs	Critical
Hardware specialist	Start with doc 24, then 21-22, then understand inference implications	Critical
Roboticist	Read 21 (fundamentals) and 24 (hardware), then read 25 for applications	Important
DevOps/SRE	Read 24 (hardware) and 13 (cost management); skip 21-22 unless optimizing models	Optional
Complete beginner	Follow “Option B: Foundations-First” path (uses all Wave 4 sequentially)	Critical

Quick decisions:

“I want to understand how models work” → 21 (Model Fundamentals) + 22 (Knowledge Transfer)
“I want to choose hardware” → 24 (Hardware Landscape) + 28 (Unified Memory Economics)
“I want to apply to real domains” → 25 (Edge & Physical AI) + 27 (Real-World Applications)
“I want everything” → Follow Option B path in “Recommended Primary Path” above

Common Workflows: “I want to…”

Goal-based navigation for specific outcomes. Each workflow shows the doc sequence, estimated time, and what you’ll achieve.

”I want to build a customer support bot” (2-3 weeks)

Week 1: Understand + Design
  01 → Choose SLM for triage, LLM for escalation (2h)
  05 → Learn ReAct loop, build intent classifier (3h)
  06 → Design harness: tools (ticket API, KB search), memory (session), loop (ReAct) (3h)
  08 → Clone starter harness, install dependencies, first working loop (4h)

Week 2: Build + Test
  04 → Implement session memory + persistent FAQ knowledge base (3h)
  15 → Design system prompt: "You are a support agent. Classify, answer, or escalate." (2h)
  11 → Test with 50+ sample conversations, measure success rate (4h)
  10 → Add input validation, rate limiting, PII filtering (3h)

Week 3: Deploy + Monitor
  09 → Add structured logging: ticket_id, intent, resolution, cost (3h)
  12 → Dockerize, deploy to staging, run integration tests (4h)
  13 → Set up cost tracking: cost per ticket, daily budget alerts (2h)
  18 → Create runbook: "Agent stuck in loop", "High cost alert" (1h)

Result: Production-ready support bot with monitoring, security, and cost controls.

”I want to reduce my harness costs by 50%” (1 week)

Day 1: Understand Current Costs
  13 → Implement token counting if not done, measure baseline (3h)
  09 → Review logs: which operations burn the most tokens? (2h)

Day 2: Quick Wins (typically saves 30-50%)
  02 → Enable KV cache quantization (GQA, INT8/INT4) (1h)
  01 → Switch to SLM for simple tasks (classification, routing) (2h)
  15 → Shorten system prompts, remove redundant instructions (1h)

Day 3: Deeper Optimizations
  03 → Quantize model (INT4 or INT8) for faster, cheaper inference (2h)
  14 → Add caching for repeated queries, memoize tool results (3h)
  04 → Trim memory: only load what's needed per session (1h)

Day 4-5: Validate + Monitor
  13 → Compare new vs old cost per request, set budget alerts (2h)
  11 → Regression test: did quality drop? If >5% drop, roll back that change (3h)

Result: 40-70% cost reduction while maintaining 90%+ quality. Typical savings: $2K-$10K/month.

”I want to deploy to production safely” (1-2 weeks)

Pre-Flight (Days 1-3)
  11 → Run quality tests: 50+ test cases, success rate ≥90% (4h)
  10 → Security audit: input validation, prompt injection defense, rate limiting (3h)
  09 → Implement structured logging: every request, every error, every cost (3h)
  17 → Compliance check: GDPR data handling, audit trail, user consent (2h)

Deployment (Days 4-7)
  12 → Dockerize application, write K8s manifests or serverless config (4h)
  12 → Deploy to staging, run smoke tests (2h)
  09 → Connect monitoring: dashboards, alerts, on-call rotation (3h)
  13 → Set production cost budgets and alerts (1h)

Go-Live (Days 8-10)
  12 → Deploy to production with canary (10% traffic) (2h)
  18 → Prepare runbook: common failures and response procedures (2h)
  09 → Monitor first 48 hours: latency, errors, cost, quality (ongoing)
  13 → Review first week costs vs projections (1h)

Result: Production deployment with monitoring, security, cost controls, and incident procedures.

”I want to deploy AI on edge devices” (2-3 weeks)

Week 1: Foundations
  21 → Understand model architecture, what can be compressed (3h)
  24 → Choose hardware: phone chip, Raspberry Pi, custom board (3h)
  22 → Learn distillation (shrink cloud model → edge model) (3h)
  03 → Find quantized models (GGUF, INT4) for your hardware (2h)

Week 2: Implementation
  25 → Edge deployment patterns, latency budgets, power constraints (4h)
  23 → If Apple: CoreML conversion, Neural Engine optimization (4h)
  26 → If cross-platform: ONNX export, TensorRT or TF Lite (4h)
  02 → KV cache optimization for limited memory (2h)

Week 3: Integration + Testing
  06 → Design harness for edge: minimal memory, fast loop, local tools (3h)
  28 → Unified memory math: how much model fits on your device? (2h)
  11 → Test on actual hardware: latency, accuracy, battery life (4h)
  27 → Real-world deployment patterns from robotics, automotive, IoT (3h)

Result: Working model on edge hardware with optimized latency and power consumption.

”I want to understand agentic AI from scratch” (4-6 weeks)

Week 1-2: Theory
  21 → Neural networks, transformers, weights, training fundamentals (5h)
  22 → Knowledge transfer: distillation, fine-tuning, RAG (4h)
  01 → Foundation models: LLM vs SLM, when to use each (2h)
  05 → Agent frameworks: CoT, ReAct, Tree of Thoughts, Reflexion (4h)

Week 3: Architecture
  06 → Seven components of a harness (3h)
  04 → Memory systems: four layers, RAG, LLM Wiki pattern (3h)
  02 → KV cache and inference optimization (2h)
  03 → Hugging Face ecosystem: finding and evaluating models (2h)

Week 4-5: Build
  08 → Clone Claw-Code, build your first agent (6h)
  14 → Advanced patterns: tool composition, state machines, caching (4h)
  15 → Prompt engineering: make your agent smarter (3h)
  24 → Hardware: understand GPU/CPU trade-offs for your setup (2h)

Week 6: Production
  11 → Testing non-deterministic systems (3h)
  09 → Operations and observability (3h)
  10 → Security hardening (2h)
  12 → Deployment patterns (2h)

Result: Complete understanding of agentic AI from theory through production deployment.

”I want to improve my agent’s quality” (3-5 days)

Day 1: Measure Current Quality
  16 → Set up evaluation framework: accuracy, relevance, coherence metrics (3h)
  11 → Run baseline tests: 50+ cases, record success rate (2h)

Day 2: Improve Prompts
  15 → Rewrite system prompt with few-shot examples (2h)
  05 → Check: is your reasoning framework right for the task? (1h)
  14 → Add self-correction loop or verification step (2h)

Day 3: Improve Knowledge
  04 → Review memory system: is the right context loaded? (2h)
  22 → Consider fine-tuning if domain-specific accuracy < 85% (3h)

Day 4-5: Validate
  16 → Re-run evaluation, compare to baseline (2h)
  11 → Regression test on original tasks (still works?) (2h)
  13 → Check: did quality improvements increase cost? Worth it? (1h)

Result: Measured quality improvement with clear before/after metrics.

Document Guide

Phase 1: Critical Documents

08_claw_code_python.md (START HERE IF USING PYTHON)

“How do I build a harness in Python?”

Guide to building a Python-based AI agent harness using common production patterns from open-source agent frameworks.

Core concepts:

Dual-layer architecture: Python orchestration + compiled runtime
Multi-provider LLM support (Claude, OpenAI, Gemini, local Ollama)
Tool registry patterns + extensible via filesystem
Model Context Protocol (MCP) integration

Read this if: You’re building in Python, want transparent architecture, need cost optimisation, or want to learn agent design.

Estimated time: 1-2 hours to understand + install, 4-6 hours to build basic harness

09_operations_and_observability.md (CRITICAL FOR PRODUCTION)

“How do I monitor, debug, and operate production harnesses?”

This is the missing operations manual. When agents are live, you need visibility into what’s happening. Covers structured logging, metrics, cost tracking, debugging stuck agents, health checks, and graceful degradation.

Read this if: Taking harness to production, setting up monitoring, implementing cost controls, debugging live agents

Estimated time: 2-3 hours to understand, 4-6 hours to implement for your harness

10_security_and_safety.md (BEFORE PRODUCTION)

“How do I protect my harness from attack and ensure compliance?”

Essential defensive strategies separating production harnesses from prototypes. Covers prompt injection prevention, input/output validation, rate limiting, sandboxing, PII handling, audit logging, and regulatory compliance (GDPR, HIPAA, SOC 2).

Read this if: Building production harnesses, handling sensitive data, meeting regulatory requirements, defending against adversarial attacks

Estimated time: 2-3 hours to understand, 4-8 hours to implement security controls for your harness

11_testing_and_qa.md (CRITICAL BEFORE PRODUCTION)

“How do I test harnesses that produce different outputs each time?”

Testing manual for non-deterministic AI systems. LLMs don’t have “pass/fail” tests—they have success rates. Covers non-deterministic testing, regression detection, quality metrics, and pre-deployment validation.

Read this if: Testing your harness before production, detecting quality regressions, ensuring reliability, validating cost projections

Estimated time: 2–3 hours to understand, 1–2 weeks to implement full test infrastructure

Phase 2: Important Documents

12_deployment_patterns.md (FROM TESTING TO PRODUCTION)

“How do I deploy my harness to production at scale?”

Operations manual for taking harnesses from local testing to reliable, scalable production. Covers Docker containerization, Kubernetes orchestration, serverless patterns, CI/CD pipelines, configuration management, scaling strategies, and health monitoring.

Read this if: Deploying harness to production, setting up CI/CD, containerizing Python code, scaling horizontally, implementing health checks

Estimated time: 2–4 hours to understand, 1–2 weeks to implement full deployment pipeline

20_integration_patterns.md (INTEGRATING WITH EXISTING SYSTEMS)

“How do I embed a harness into my application?”

Comprehensive guide to integrating harnesses into production systems. Covers 12 integration patterns: harness as library, REST API, async background jobs, event-driven (Kafka/Pub-Sub), GraphQL, WebSocket/streaming, third-party bots (Slack/Discord/Telegram), database integration, file system access, and monitoring/observability/authentication layers.

Read this if: Embedding harness into existing applications, building APIs, connecting to databases, creating chatbots, implementing event-driven architectures

Estimated time: 2-3 hours to understand, 1-2 weeks to implement integration for your architecture

Foundational Reference Documents

01_foundation_models.md

“What kinds of models exist and when to use each?” [PHASE 1]

LLM vs SLM, multimodal models, training vs inference costs, and when to use each. Key decision: 7B–13B SLM for agent loops, 70B+ LLM for verification.

02_kv_cache_optimization.md

“How do we run longer context efficiently?” [PHASE 1]

KV cache fundamentals and modern optimization techniques: Grouped Query Attention (GQA), PagedAttention (vLLM), INT8/INT4 KV cache quantization, and TurboQuant (3-bit, 6x memory reduction, zero accuracy loss — ICLR 2026).

03_huggingface_ecosystem.md

“Where do I find models and how do I evaluate them?” [PHASE 1]

Finding models on Hugging Face, quantization options (AWQ, GPTQ, 8-bit), and performance trade-offs. Includes decision tree for model selection.

04_memory_systems.md

“How do agents remember, learn, and maintain knowledge?” [PHASE 1]

Four-layer memory architecture, RAG, and Karpathy’s LLM Wiki Pattern (compiled markdown knowledge for bases under ~100 sources). Includes Claude Code’s proven pattern.

05_ai_agents.md

“How do agents think and make decisions?” [PHASE 1]

Agentic loop definition, reasoning frameworks (ReAct, Tree of Thoughts, Reflexion, etc.), and recommendations for harnesses.

06_harness_architecture.md

“What is a complete harness and how do I build one?” [PHASE 1]

Definition, seven essential components, proven patterns (Single-Agent, Initializer-Executor, Multi-Agent), implementation checklists, and performance optimizations.

07_openclaw_reference.md

“What can we learn from open-source agent architectures?” [REFERENCE ONLY]

Deep dive into common patterns from open-source agent frameworks (file-based tool registry, skill composition, multi-agent coordination). Read after understanding core concepts in 06.

Quick Reference by Domain

For ML Engineers & AI Researchers

“What are weights? How do transformers work? How do I transfer knowledge?”

Start: 21_model_fundamentals.md — Complete foundations on weights, parameters, neural networks, transformers
Then: 22_knowledge_transfer_methods.md — Distillation, fine-tuning, LoRA, when to use each
Systems: 24_hardware_landscape.md — Hardware choices affect training/inference
Advanced: 02_kv_cache_optimization.md (KV cache techniques), 04_memory_systems.md (knowledge systems)

For Hardware & Systems Engineers

“Should I buy RTX 4090 or MacBook? What’s unified memory? GPU vs CPU vs TPU?”

Start: 24_hardware_landscape.md — CPU, GPU, TPU, Apple Silicon, mobile chips, unified memory, cost/performance
Context: 21_model_fundamentals.md (understand what hardware runs)
Optimization: 02_kv_cache_optimization.md (how hardware acceleration works)
Production: 09_operations_and_observability.md, 12_deployment_patterns.md

For Roboticists & Embodied AI Engineers

“How do I run AI on robots? What’s the full AI stack for physical systems?”

Foundations: 21_model_fundamentals.md (how models work)
Agents: 05_ai_agents.md (agentic loop, decision-making)
Hardware: 24_hardware_landscape.md (edge inference, mobile chips, power constraints)
Real-world: Docs 25 & 27 (applications, case studies)
Systems: 06_harness_architecture.md (orchestration), 09_operations_and_observability.md (telemetry)

For Data Scientists Moving to Production

“I have a model. How do I deploy it and keep it working?”

Architecture: 06_harness_architecture.md (7 essential components)
Testing: 11_testing_and_qa.md (non-deterministic testing, quality metrics)
Ops: 09_operations_and_observability.md (logging, monitoring, cost tracking)
Security: 10_security_and_safety.md (input validation, PII, compliance)
Deploy: 12_deployment_patterns.md (Docker, Kubernetes, CI/CD)

For Platform/DevOps Engineers

“How do I operationalize and scale AI systems?”

Architecture: 06_harness_architecture.md (components, patterns)
Testing: 11_testing_and_qa.md (quality assurance for non-deterministic systems)
Ops: 09_operations_and_observability.md (structured logging, metrics, cost tracking)
Security: 10_security_and_safety.md (hardening, compliance)
Deploy: 12_deployment_patterns.md (Docker, Kubernetes, CI/CD, scaling)
Integration: 20_integration_patterns.md (API patterns, event-driven, GraphQL, WebSocket)

Hardware Economics: Why Unified Memory Matters

The unified memory advantage (Apple M-series vs traditional GPUs) is a game-changer for inference:

Aspect	NVIDIA GPU	Apple M-series
Memory Architecture	Separate CPU/GPU memory + PCIe bus	CPU + GPU share same memory
Data Transfer Overhead	Copy CPU→GPU (slow), compute, copy GPU→CPU	No copying, instant access
Practical Impact	20–40% slower for memory-bound workloads	20–40% faster for many AI tasks
Inference Latency	Higher due to data movement	Lower, especially streaming
Best Use Case	High-throughput batch inference	Interactive/streaming inference

In practice: A 13B model on MacBook M3 Max (unified memory) can outperform RTX 4070 for interactive inference despite lower raw TFLOPS, because no PCIe bottleneck.

For production decision-making: Consider unified memory as a +20% throughput advantage for inference workloads.

Production Deployment Checklist

Before Week 6 (Production Readiness)

From 11_testing_and_qa.md:

Baseline established (success rate, latency, cost measured)
Regression tests configured (comparative metrics vs baseline)
Smoke test suite passes (basic functionality verified)
Load tests pass (concurrent request handling verified)
Pre-deployment security review complete

From 09_operations_and_observability.md:

Structured JSON logging implemented
Key metrics configured (latency p50/p95/p99, throughput, cost)
Cost tracking active (real-time budget enforcement)
Health checks implemented (model, memory, tools)
Dashboard/alerting configured
Loop detection in place (iteration limits + escape strategies)

From 10_security_and_safety.md:

Input validation for all untrusted sources
Output filtering (no PII leaks, no dangerous commands)
Rate limiting configured (per-user/global)
Audit logging set up (immutable, compliant)
Secret scanning complete (no hardcoded API keys)
Compliance review done (GDPR, HIPAA, FTC AI guidance if applicable)

From 12_deployment_patterns.md (optional but recommended):

Dockerfile created and tested locally
Docker image builds successfully
Kubernetes manifests written (if using K8s)
Health check probes configured (liveness, readiness, startup)
CI/CD pipeline automated (lint → test → build → deploy)
Secrets management configured (no env vars in git)
Canary or blue-green deployment strategy selected

Consolidated Performance Reference

Model Performance Metrics

Metric	Typical Value	Context
SLM (7B–13B) Speed	10–30× cheaper than LLM	Preferred for agent loops
Phi-4 7B Throughput	~40 tokens/sec (RTX 4090)	Instruction-tuned, fast
Mistral 7B Throughput	~35 tokens/sec (RTX 4090)	Good balance of speed/quality
LLM (70B+) Speed	1–5 tokens/sec (RTX 4090)	Use for verification steps
First token latency	50–200ms	Initial computation time
Streaming latency	1–5ms per token	Subsequent tokens (with KV cache)

Quantization Impact

Technique	Speedup	Memory Reduction	Accuracy Loss
AWQ 4-bit	3–4×	~75%	<0.5%
GPTQ 4-bit	3–4×	~75%	<0.5%
8-bit Quantization	2–2.5×	~50%	<0.1%
GQA (KV cache)	2-4× (attention)	2-4× (KV cache)	Minimal
TurboQuant (3-bit KV)	Up to 8× (attention)	6× (KV cache)	Zero loss
INT4 KV cache	~3× (attention)	4× (KV cache)	Small

Cost Comparison (April 2026 Pricing)

Model	Cost per 1M tokens	Best For
Claude 3.5 Sonnet	$3 input, $15 output	Verification, complex reasoning
GPT-4o	$5 input, $15 output	Reasoning, multimodal
Llama 3.1 70B (API)	$0.75 input, $0.90 output	Fast reasoning, cost-effective
Local SLM (self-hosted)	~$0 (hardware cost)	Cost optimization, privacy
Hybrid (80% local, 20% cloud)	Up to 80-90% cheaper than pure cloud (when most requests route locally)	Recommended harness pattern

Note: Prices approximate as of early 2025. Check provider websites for current rates.

Key Insights & Takeaways

For Building Your Harness

Model choice → 7B–13B SLM for loops, 70B+ LLM for verification, quantize to AWQ 4-bit
Memory architecture → Four-layer (context/working/persistent/auto-consolidation), LLM Wiki for <400K words, <10K startup tokens
Agentic loop → Start with ReAct, use Plan-and-Execute for long tasks, add Reflexion for quality
Harness pattern → Single-agent for bounded tasks, Initializer-Executor for long-running, Multi-agent for complex
Performance & cost → GQA + KV cache quantization for memory savings, SLMs 10-30x cheaper, AWQ quantization 3-4x speedup

2026 Trends Affecting Harness Design

Trend	Impact	Action
SLMs dominate agentic AI	Speed critical	Build loop for speed; verify with LLM
KV cache quantization	Longer contexts on same hardware	Use GQA models + INT8/INT4 cache; TurboQuant for 3-bit/6x savings
LLM Wiki pattern	Alternative to RAG	Use Karpathy’s markdown wiki for ~100 sources
Quantization mainstream	4-bit standard	Default to AWQ quantization
Multi-agent orchestration	Specialization via delegation	Consider hierarchical pattern
Auto-dream consolidation	Remember across sessions	Implement auto-consolidation

Building Your First Harness: Recommended Sequence

Week 1: Foundation & Hardware

Read 21_model_fundamentals.md (1 hour) — understand weights, parameters, transformers
Read 24_hardware_landscape.md (1 hour) — understand hardware trade-offs
Hardware Decision: Choose your development platform:
- Local development: MacBook M3 (16GB) or RTX 4070 desktop ($600)
- Cloud: Use appropriate GPU size for your model
- Edge/Inference: Consider Apple M-series for edge deployment
Read 01_foundation_models.md (1–2 hours)
Choose a model from HF (15 min)
Read 05_ai_agents.md — pick ReAct as framework (1–2 hours)

Week 2: Core Architecture

Read 06_harness_architecture.md (1–2 hours)
Implement minimum viable harness with 3–5 tools and ReAct loop

Week 3: Memory & Optimization

Read 04_memory_systems.md (1–2 hours)
Implement memory layers (CLAUDE.md, MEMORY.md, topic files)
Read 02_kv_cache_optimization.md (30 min)
Enable quantization and KV cache optimization

Week 4: Long-Running Harness (If needed)

Implement Initializer-Executor split
Add feature list, progress file, self-verification loop
Test with realistic long-running task

Week 5: Testing & Quality Assurance

Read 11_testing_and_qa.md (2–3 hours)
Build test infrastructure (unit, integration, load tests)
Establish baselines and regression detection
Run pre-deployment validation (success rate ≥90%?)

Week 6: Production Readiness (Before Deploy)

Read 09_operations_and_observability.md (2–3 hours)
Implement monitoring & logging (JSON logs, metrics, cost tracking)
Set up cost controls and health checks
Final deployment checks (tests, logs, alerts, security audit)

Next Steps After Week 6: Advanced Topics & Continuous Learning

Immediate Post-Deployment (First Month)

Iterate on Observability → Review metrics vs projections, adjust alerting, optimize dashboards
Security Hardening → Run adversarial testing, review attack patterns, refine rate limiting
Cost Optimization → Measure actual cost/task, identify expensive patterns, experiment with routing
Quality Baseline Refinement → Compare actual vs projected success rate, identify failure patterns, tune LLM parameters

Month 2-3: Feature Expansion & Optimization

Choose based on priorities:

For Cost & Performance: Experiment with KV cache quantization, profile tools, implement caching
For Reliability at Scale: Implement doc 12 deployment patterns, set up canary deployments, build rollback automation
For Advanced Reasoning: Try alternative frameworks (ToT, Reflexion), multi-agent patterns, confidence scoring
For Knowledge Management: Evaluate wiki pattern scaling, hybrid RAG, knowledge versioning

Month 3+: Production Maturity

Operational Excellence: Runbooks, decision trees, post-mortem process, automated remediation
Monitoring & Analytics: Continuous benchmarking, quality dashboards, long-term trend tracking
Advanced Security: Red-teaming, explainability, compliance automation
Team & Process: Knowledge transfer, playbooks, on-call runbooks

Quarterly Updates

Review CORPUS_AUDIT.md for improvements and new patterns
Check model landscape (new SLMs, quantization techniques)
Re-run baselines on latest models
Review spending trends and optimization opportunities

Reference: Recommended Tools & Frameworks

Purpose	Recommendation	Why
LLM	Claude (Anthropic)	Best safety, reasoning, tool use
Open Model	Llama 3 (7B–70B)	Proven, widely deployed
SLM	Phi-4 or Mistral 7B	Optimized for instruction-following
Quantization	AWQ (4-bit)	Best quality/speed trade-off
Memory	Markdown files + git	Human-readable, version-controlled
Reasoning Loop	ReAct	Simplest, fastest, proven
Testing	pytest + custom harness	Multiple-run tests for non-deterministic systems
Monitoring	Prometheus/Datadog	Metrics collection and alerting
Logging	Structured JSON	Cost, errors, performance analysis
Deployment	Docker + K8s/Serverless	Depends on scale and complexity

Questions & Next Steps

For terminology help

Term unclear? → See glossary.md for 125+ definitions covering foundations, hardware, and applications with usage context

For implementation help

Implementation checklist? → 06_harness_architecture.md
Tool integration? → 05_ai_agents.md Tools section
Memory architecture? → 04_memory_systems.md

For understanding gaps

Concept unclear? → Links at end of each document
Model selection stuck? → Flow chart in 03_huggingface_ecosystem.md
Reasoning framework choice? → Comparison in 05_ai_agents.md

To validate your harness

Has all 7 components? → 06_harness_architecture.md
Memory properly layered? → 04_memory_systems.md
Using proven pattern? → 05_ai_agents.md
Optimal model size? → 01_foundation_models.md + 03_huggingface_ecosystem.md

Before production deployment

Testing complete? → 11_testing_and_qa.md checklist
Observability implemented? → 09_operations_and_observability.md
Cost tracking working? → Cost section in 09
Health checks ready? → Health checks section in 09
Security hardened? → 10_security_and_safety.md checklist
Deployment automated? → 12_deployment_patterns.md (Docker, K8s, CI/CD)

For specific production scenarios

Agent stuck in loop? → Debugging in 09_operations_and_observability.md
Cost exceeding budget? → Cost section in 09
Security concern? → Attack vectors in 10_security_and_safety.md
Test results inconsistent? → Non-deterministic testing in 11_testing_and_qa.md

Changelog & Source Attribution

April 2026: Expanded to AI/ML Engineering Handbook
- Handbook restructure: Renamed from “Harness Corpus” to “AI/ML Engineering Handbook (With Harness Focus)”
- New Parts structure: Foundations (21-22), Hardware (23-24, 26, 28), Harnesses (01-20), Applications (25, 27)
- New documents: 21 (Model Fundamentals), 22 (Knowledge Transfer), 24 (Hardware Landscape)
- Expanded glossary: 75+ → 125+ terms covering foundations, hardware, systems
- New quick reference: Domain-specific learning paths (ML engineers, hardware engineers, roboticists, platform engineers)
- Hardware decision guide: How to choose development/deployment hardware
- Hardware economics section: Unified memory advantage explanation
- New role-based learning paths: Hardware engineers, ML engineers, roboticists
- Updated building sequence: Week 1 now includes hardware selection
Original April 2026: Created corpus + improvements
- Original 8 documents (01–08)
- CORPUS_AUDIT.md: Comprehensive gap analysis
- New documents: 09 (Operations), 10 (Security), 11 (Testing), 12 (Deployment)
- New glossary: 75+ terms defined with context
- Index improvements: Phase labeling, consolidated metrics, clearer paths, post-deployment guidance
- KV cache optimization techniques (GQA, PagedAttention, INT8/INT4, TurboQuant — Google Research Blog, ICLR 2026)
- LLM Wiki pattern (compiled markdown knowledge — Karpathy’s LLM Wiki Gist, April 2026)
- Claude Code memory architecture (Anthropic)
- Open-source agent framework patterns

For citations and detailed sources, see individual document footers.