The Harness Handbook — Start Here
Master navigation, learning paths, role-based guides, and goal-based workflows for the complete AI/ML Engineering Handbook.
A comprehensive guide to AI/ML engineering across foundations, hardware, agent harnesses, and real-world applications. Whether you’re building models, choosing hardware, designing agents, or deploying to production, this handbook provides the knowledge to succeed.
Status: April 2026 | Covers KV cache optimization (GQA, PagedAttention, TurboQuant), Karpathy’s LLM Wiki pattern, and production harness architecture
Handbook Structure: Four Parts
The handbook is organized into four major parts, each serving different needs:
Part 1: AI/ML Foundations (Docs 21-22)
Understand how AI models work at a fundamental level.
- Doc 21: Model Fundamentals — Weights, parameters, neural networks, transformers, training
- Doc 22: Knowledge Transfer Methods — Distillation, fine-tuning, LoRA, RAG
Part 2: Hardware & Systems (Docs 23-24, 26, 28)
Choose the right hardware and understand system-level concerns.
- Doc 24: Hardware Landscape — CPU vs GPU, NVIDIA, Apple Silicon, mobile chips, cost/performance trade-offs
- Docs 23, 26, 28: (Additional hardware and systems topics)
Part 3: Agent Harnesses (Docs 01-20)
Design, build, and deploy production AI agent systems.
- Foundation models, agents, memory, security, testing, deployment
- Reference implementations and patterns
- This is the original core curriculum
Part 4: Real-World Applications (Docs 25, 27)
Learn how to apply harnesses to specific domains and use cases.
What Happened? Handbook Improvements (April 18, 2026)
This handbook was restructured and expanded from “Harness Corpus” to comprehensive AI/ML handbook. Key improvements:
- Expanded glossary (
glossary.md) — 125+ terms covering foundations, hardware, and applications - New Part 1 & 2 — Model fundamentals and hardware landscape added
- Three critical foundation documents — Operations & Observability (09), Security & Safety (10), Testing & QA (11)
- Deployment guide (
12_deployment_patterns.md) — Docker, Kubernetes, CI/CD patterns - Single primary path — Clearer “Recommended Path” with alternatives clearly marked
- Phase labeling — Documents marked as Phase 1 (Critical) or Phase 2 (Important)
- Production deployment checklist — Clear pre-deployment validation steps
- Consolidated performance metrics — One reference table for all benchmarks
- “Next Steps After Week 6” — Advanced topics and continuous learning
- New “Quick Reference by Domain” — Find answers by role and goal
Quick Navigation
New to these terms? → See glossary.md for definitions of Agent, Harness, LLM, Token, ReAct, and 75+ other key concepts.
Recommended Primary Path
Choose based on your background:
Option A: Fast Track (For Practitioners, 5-6 weeks)
01_foundation_models.md— What models exist and when to use them05_ai_agents.md— How agents think and decide (ReAct, Chain-of-Thought, Tree of Thoughts, etc.)06_harness_architecture.md— Seven components of a complete harness08_claw_code_python.md— Reference implementation in Python04_memory_systems.md— Memory systems and the LLM Wiki pattern (compiled markdown knowledge)02_kv_cache_optimization.md— Optimization for long contexts
Then proceed to “Production Deployment” below.
Option B: Foundations-First (For Researchers, 7-8 weeks)
21_model_fundamentals.md— How neural networks, transformers, and weights actually work22_knowledge_transfer_methods.md— Distillation, fine-tuning, LoRA, RAG (theoretical foundations)01_foundation_models.md— Practical model selection05_ai_agents.md— Agentic reasoning frameworks06_harness_architecture.md— Harness design24_hardware_landscape.md— Hardware understanding (CPU, GPU, Neural Engines, cost/performance)- Rest of Option A
Then, choose your path based on your needs:
For Production Deployment (After step 4 in Option A or step 6 in Option B):
→ 11_testing_and_qa.md (establish quality baselines) [PHASE 1 - CRITICAL]
→ 09_operations_and_observability.md (monitoring & debugging) [PHASE 1 - CRITICAL]
→ 10_security_and_safety.md (security hardening) [PHASE 1 - CRITICAL]
→ 12_deployment_patterns.md (containerization & orchestration) [PHASE 2 - Important]
Alternative Starting Points
If you’re building in Python right now → Jump directly to 08_claw_code_python.md, then revisit foundational docs as needed
If you only care about optimization → 02_kv_cache_optimization.md + 03_huggingface_ecosystem.md (quantization specifics)
If you’re researching agents → 05_ai_agents.md → 04_memory_systems.md → then reference implementations in 08
Quick lookup by topic:
- What are weights and how do neural networks work? →
21_model_fundamentals.md(weights, parameters, neurons, layers, transformers) - What’s the difference between distillation and fine-tuning? →
22_knowledge_transfer_methods.md - Should I buy RTX 4090 or MacBook Pro? →
24_hardware_landscape.md(GPU vs CPU, Apple Silicon, unified memory, cost/performance) - How do I choose a model? →
01_foundation_models.md+03_huggingface_ecosystem.md - How do I optimize inference? →
02_kv_cache_optimization.md(GQA, PagedAttention, KV cache quantization, TurboQuant) - How do I build memory systems? →
04_memory_systems.md(includes Karpathy’s LLM Wiki pattern for compiled markdown knowledge) - How do agents work? →
05_ai_agents.md(ReAct, ToT, Reflexion frameworks) - What are the components of a harness? →
06_harness_architecture.md - I’m building in Python, where do I start? →
08_claw_code_python.md(installation + patterns) - How do I monitor & debug production harnesses? →
09_operations_and_observability.md(logging, metrics, cost tracking, debugging) [PHASE 1] - How do I protect my harness from attack? →
10_security_and_safety.md(injection, validation, rate limiting, compliance) [PHASE 1] - How do I test my harness before production? →
11_testing_and_qa.md(non-deterministic testing, regression, quality metrics) [PHASE 1] - How do I deploy my harness to production? →
12_deployment_patterns.md(Docker, Kubernetes, CI/CD, scaling) [PHASE 2] - How do I embed a harness into my application? →
20_integration_patterns.md(REST API, background jobs, events, GraphQL, WebSocket, Slack/Discord bots) [PHASE 2]
By role…
Software Engineer Building Python Harness (Use Option A: Fast Track)
- Phase 1 (Critical): 08 (Claw-Code), 05 (Agents), 06 (Architecture), 04 (Memory), 11 (Testing), 09 (Monitoring), 10 (Security)
- Phase 2: 12 (Deployment), 20 (Integration), 03 (HF ecosystem), 02 (Optimization)
- Optional Reference: 01 (Models), 21 (Fundamentals) — read when you need theoretical grounding
ML Engineer / ML Researcher (Use Option B: Foundations-First)
- Phase 1 Foundations: 21 (Model fundamentals), 22 (Knowledge transfer methods), 24 (Hardware landscape)
- Phase 2 Applications: 01 (Foundation models), 03 (Hugging Face), 04 (Memory systems), 02 (KV cache)
- Deep dive topics: KV cache techniques in doc 02, LLM Wiki pattern in 04, quantization details in 03
- Build production: 05 (Agents) → 06 (Architecture) → 08 (Implementation) → 09-11 (Quality gates)
Hardware Engineer / Systems Builder
- Week 1: 24 (Hardware landscape) — CPU vs GPU, Apple Silicon, mobile chips, unified memory, cost/performance analysis
- Week 2: 21 (Model fundamentals), 02 (KV cache), 01 (Model selection) — understand how hardware impacts inference
- Week 3: 28 (Unified memory economics) — deep dive into Apple M-series advantages
- Week 4: 09 (Observability), 12 (Deployment) — production concerns
- Focus areas: VRAM requirements, thermal management, cost per inference, memory bandwidth
Roboticist / Physical AI Engineer
- Foundations: 21 (Model fundamentals), 24 (Hardware) — especially mobile/edge sections
- Agentic systems: 05 (Agents), 06 (Harness architecture)
- On-device inference: 23 (Apple Intelligence & CoreML), 25 (Edge & Physical AI)
- Real-world applications: 27 (Real-world AI applications — section on robotics/autonomous vehicles)
- Production setup: 09 (Observability for robot telemetry), 12 (Deployment)
- Then build: 08 (Implementation), 04 (Memory for robot decision-making)
Learning Agentic AI from Scratch (Complete Beginner)
- Start with fundamentals: Option B path (21 → 22 → 01 → 05 → 06 → 08)
- Then master architecture: 04 (Memory) → 02 (Optimization) → 03 (Model ecosystem)
- Then build for production: 11 (Testing) → 09 (Monitoring) → 10 (Security) → 12 (Deployment)
- Skip until ready: 07 (Open-source agent architectures) — only needed after understanding core concepts
- Reference: glossary.md for any unfamiliar terms
Product Manager / Architecture Decision Maker
- Understand capabilities: 01 (Foundation models), 05 (Reasoning frameworks) — why harnesses make sense
- Understand architecture: 06 (Seven harness components), 08 (Python implementation)
- Cost decisions: 13 (Cost management) + 24 (Hardware landscape) — understand potential cost savings with hybrid routing
- Risk management: 10 (Security & safety), 17 (Regulatory & ethics)
- Key insights: Cost (SLMs 10-30× cheaper), Speed (agent loops 100-1000× faster), Capability (LLMs for verification)
Wave 4 Documents: When to Read These
Wave 4 (Docs 21-28) are the “AI/ML Foundations & Hardware” section. They’re recommended for everyone eventually, but timing matters:
| Your Background | When to Read Wave 4 | Priority |
|---|---|---|
| Software engineer (practicing) | After understanding harnesses (docs 01-08), read 21-22 when you want deeper model knowledge | Optional |
| ML engineer (doing research) | Read first (21-22, 24), before diving into harness-specific docs | Critical |
| Hardware specialist | Start with doc 24, then 21-22, then understand inference implications | Critical |
| Roboticist | Read 21 (fundamentals) and 24 (hardware), then read 25 for applications | Important |
| DevOps/SRE | Read 24 (hardware) and 13 (cost management); skip 21-22 unless optimizing models | Optional |
| Complete beginner | Follow “Option B: Foundations-First” path (uses all Wave 4 sequentially) | Critical |
Quick decisions:
- “I want to understand how models work” → 21 (Model Fundamentals) + 22 (Knowledge Transfer)
- “I want to choose hardware” → 24 (Hardware Landscape) + 28 (Unified Memory Economics)
- “I want to apply to real domains” → 25 (Edge & Physical AI) + 27 (Real-World Applications)
- “I want everything” → Follow Option B path in “Recommended Primary Path” above
Common Workflows: “I want to…”
Goal-based navigation for specific outcomes. Each workflow shows the doc sequence, estimated time, and what you’ll achieve.
”I want to build a customer support bot” (2-3 weeks)
Week 1: Understand + Design
01 → Choose SLM for triage, LLM for escalation (2h)
05 → Learn ReAct loop, build intent classifier (3h)
06 → Design harness: tools (ticket API, KB search), memory (session), loop (ReAct) (3h)
08 → Clone starter harness, install dependencies, first working loop (4h)
Week 2: Build + Test
04 → Implement session memory + persistent FAQ knowledge base (3h)
15 → Design system prompt: "You are a support agent. Classify, answer, or escalate." (2h)
11 → Test with 50+ sample conversations, measure success rate (4h)
10 → Add input validation, rate limiting, PII filtering (3h)
Week 3: Deploy + Monitor
09 → Add structured logging: ticket_id, intent, resolution, cost (3h)
12 → Dockerize, deploy to staging, run integration tests (4h)
13 → Set up cost tracking: cost per ticket, daily budget alerts (2h)
18 → Create runbook: "Agent stuck in loop", "High cost alert" (1h)
Result: Production-ready support bot with monitoring, security, and cost controls.
”I want to reduce my harness costs by 50%” (1 week)
Day 1: Understand Current Costs
13 → Implement token counting if not done, measure baseline (3h)
09 → Review logs: which operations burn the most tokens? (2h)
Day 2: Quick Wins (typically saves 30-50%)
02 → Enable KV cache quantization (GQA, INT8/INT4) (1h)
01 → Switch to SLM for simple tasks (classification, routing) (2h)
15 → Shorten system prompts, remove redundant instructions (1h)
Day 3: Deeper Optimizations
03 → Quantize model (INT4 or INT8) for faster, cheaper inference (2h)
14 → Add caching for repeated queries, memoize tool results (3h)
04 → Trim memory: only load what's needed per session (1h)
Day 4-5: Validate + Monitor
13 → Compare new vs old cost per request, set budget alerts (2h)
11 → Regression test: did quality drop? If >5% drop, roll back that change (3h)
Result: 40-70% cost reduction while maintaining 90%+ quality. Typical savings: $2K-$10K/month.
”I want to deploy to production safely” (1-2 weeks)
Pre-Flight (Days 1-3)
11 → Run quality tests: 50+ test cases, success rate ≥90% (4h)
10 → Security audit: input validation, prompt injection defense, rate limiting (3h)
09 → Implement structured logging: every request, every error, every cost (3h)
17 → Compliance check: GDPR data handling, audit trail, user consent (2h)
Deployment (Days 4-7)
12 → Dockerize application, write K8s manifests or serverless config (4h)
12 → Deploy to staging, run smoke tests (2h)
09 → Connect monitoring: dashboards, alerts, on-call rotation (3h)
13 → Set production cost budgets and alerts (1h)
Go-Live (Days 8-10)
12 → Deploy to production with canary (10% traffic) (2h)
18 → Prepare runbook: common failures and response procedures (2h)
09 → Monitor first 48 hours: latency, errors, cost, quality (ongoing)
13 → Review first week costs vs projections (1h)
Result: Production deployment with monitoring, security, cost controls, and incident procedures.
”I want to deploy AI on edge devices” (2-3 weeks)
Week 1: Foundations
21 → Understand model architecture, what can be compressed (3h)
24 → Choose hardware: phone chip, Raspberry Pi, custom board (3h)
22 → Learn distillation (shrink cloud model → edge model) (3h)
03 → Find quantized models (GGUF, INT4) for your hardware (2h)
Week 2: Implementation
25 → Edge deployment patterns, latency budgets, power constraints (4h)
23 → If Apple: CoreML conversion, Neural Engine optimization (4h)
26 → If cross-platform: ONNX export, TensorRT or TF Lite (4h)
02 → KV cache optimization for limited memory (2h)
Week 3: Integration + Testing
06 → Design harness for edge: minimal memory, fast loop, local tools (3h)
28 → Unified memory math: how much model fits on your device? (2h)
11 → Test on actual hardware: latency, accuracy, battery life (4h)
27 → Real-world deployment patterns from robotics, automotive, IoT (3h)
Result: Working model on edge hardware with optimized latency and power consumption.
”I want to understand agentic AI from scratch” (4-6 weeks)
Week 1-2: Theory
21 → Neural networks, transformers, weights, training fundamentals (5h)
22 → Knowledge transfer: distillation, fine-tuning, RAG (4h)
01 → Foundation models: LLM vs SLM, when to use each (2h)
05 → Agent frameworks: CoT, ReAct, Tree of Thoughts, Reflexion (4h)
Week 3: Architecture
06 → Seven components of a harness (3h)
04 → Memory systems: four layers, RAG, LLM Wiki pattern (3h)
02 → KV cache and inference optimization (2h)
03 → Hugging Face ecosystem: finding and evaluating models (2h)
Week 4-5: Build
08 → Clone Claw-Code, build your first agent (6h)
14 → Advanced patterns: tool composition, state machines, caching (4h)
15 → Prompt engineering: make your agent smarter (3h)
24 → Hardware: understand GPU/CPU trade-offs for your setup (2h)
Week 6: Production
11 → Testing non-deterministic systems (3h)
09 → Operations and observability (3h)
10 → Security hardening (2h)
12 → Deployment patterns (2h)
Result: Complete understanding of agentic AI from theory through production deployment.
”I want to improve my agent’s quality” (3-5 days)
Day 1: Measure Current Quality
16 → Set up evaluation framework: accuracy, relevance, coherence metrics (3h)
11 → Run baseline tests: 50+ cases, record success rate (2h)
Day 2: Improve Prompts
15 → Rewrite system prompt with few-shot examples (2h)
05 → Check: is your reasoning framework right for the task? (1h)
14 → Add self-correction loop or verification step (2h)
Day 3: Improve Knowledge
04 → Review memory system: is the right context loaded? (2h)
22 → Consider fine-tuning if domain-specific accuracy < 85% (3h)
Day 4-5: Validate
16 → Re-run evaluation, compare to baseline (2h)
11 → Regression test on original tasks (still works?) (2h)
13 → Check: did quality improvements increase cost? Worth it? (1h)
Result: Measured quality improvement with clear before/after metrics.
Document Guide
Phase 1: Critical Documents
08_claw_code_python.md (START HERE IF USING PYTHON)
“How do I build a harness in Python?”
Guide to building a Python-based AI agent harness using common production patterns from open-source agent frameworks.
Core concepts:
- Dual-layer architecture: Python orchestration + compiled runtime
- Multi-provider LLM support (Claude, OpenAI, Gemini, local Ollama)
- Tool registry patterns + extensible via filesystem
- Model Context Protocol (MCP) integration
Read this if: You’re building in Python, want transparent architecture, need cost optimisation, or want to learn agent design.
Estimated time: 1-2 hours to understand + install, 4-6 hours to build basic harness
09_operations_and_observability.md (CRITICAL FOR PRODUCTION)
“How do I monitor, debug, and operate production harnesses?”
This is the missing operations manual. When agents are live, you need visibility into what’s happening. Covers structured logging, metrics, cost tracking, debugging stuck agents, health checks, and graceful degradation.
Read this if: Taking harness to production, setting up monitoring, implementing cost controls, debugging live agents
Estimated time: 2-3 hours to understand, 4-6 hours to implement for your harness
10_security_and_safety.md (BEFORE PRODUCTION)
“How do I protect my harness from attack and ensure compliance?”
Essential defensive strategies separating production harnesses from prototypes. Covers prompt injection prevention, input/output validation, rate limiting, sandboxing, PII handling, audit logging, and regulatory compliance (GDPR, HIPAA, SOC 2).
Read this if: Building production harnesses, handling sensitive data, meeting regulatory requirements, defending against adversarial attacks
Estimated time: 2-3 hours to understand, 4-8 hours to implement security controls for your harness
11_testing_and_qa.md (CRITICAL BEFORE PRODUCTION)
“How do I test harnesses that produce different outputs each time?”
Testing manual for non-deterministic AI systems. LLMs don’t have “pass/fail” tests—they have success rates. Covers non-deterministic testing, regression detection, quality metrics, and pre-deployment validation.
Read this if: Testing your harness before production, detecting quality regressions, ensuring reliability, validating cost projections
Estimated time: 2–3 hours to understand, 1–2 weeks to implement full test infrastructure
Phase 2: Important Documents
12_deployment_patterns.md (FROM TESTING TO PRODUCTION)
“How do I deploy my harness to production at scale?”
Operations manual for taking harnesses from local testing to reliable, scalable production. Covers Docker containerization, Kubernetes orchestration, serverless patterns, CI/CD pipelines, configuration management, scaling strategies, and health monitoring.
Read this if: Deploying harness to production, setting up CI/CD, containerizing Python code, scaling horizontally, implementing health checks
Estimated time: 2–4 hours to understand, 1–2 weeks to implement full deployment pipeline
20_integration_patterns.md (INTEGRATING WITH EXISTING SYSTEMS)
“How do I embed a harness into my application?”
Comprehensive guide to integrating harnesses into production systems. Covers 12 integration patterns: harness as library, REST API, async background jobs, event-driven (Kafka/Pub-Sub), GraphQL, WebSocket/streaming, third-party bots (Slack/Discord/Telegram), database integration, file system access, and monitoring/observability/authentication layers.
Read this if: Embedding harness into existing applications, building APIs, connecting to databases, creating chatbots, implementing event-driven architectures
Estimated time: 2-3 hours to understand, 1-2 weeks to implement integration for your architecture
Foundational Reference Documents
01_foundation_models.md
“What kinds of models exist and when to use each?” [PHASE 1]
LLM vs SLM, multimodal models, training vs inference costs, and when to use each. Key decision: 7B–13B SLM for agent loops, 70B+ LLM for verification.
02_kv_cache_optimization.md
“How do we run longer context efficiently?” [PHASE 1]
KV cache fundamentals and modern optimization techniques: Grouped Query Attention (GQA), PagedAttention (vLLM), INT8/INT4 KV cache quantization, and TurboQuant (3-bit, 6x memory reduction, zero accuracy loss — ICLR 2026).
03_huggingface_ecosystem.md
“Where do I find models and how do I evaluate them?” [PHASE 1]
Finding models on Hugging Face, quantization options (AWQ, GPTQ, 8-bit), and performance trade-offs. Includes decision tree for model selection.
04_memory_systems.md
“How do agents remember, learn, and maintain knowledge?” [PHASE 1]
Four-layer memory architecture, RAG, and Karpathy’s LLM Wiki Pattern (compiled markdown knowledge for bases under ~100 sources). Includes Claude Code’s proven pattern.
05_ai_agents.md
“How do agents think and make decisions?” [PHASE 1]
Agentic loop definition, reasoning frameworks (ReAct, Tree of Thoughts, Reflexion, etc.), and recommendations for harnesses.
06_harness_architecture.md
“What is a complete harness and how do I build one?” [PHASE 1]
Definition, seven essential components, proven patterns (Single-Agent, Initializer-Executor, Multi-Agent), implementation checklists, and performance optimizations.
07_openclaw_reference.md
“What can we learn from open-source agent architectures?” [REFERENCE ONLY]
Deep dive into common patterns from open-source agent frameworks (file-based tool registry, skill composition, multi-agent coordination). Read after understanding core concepts in 06.
Quick Reference by Domain
For ML Engineers & AI Researchers
“What are weights? How do transformers work? How do I transfer knowledge?”
- Start:
21_model_fundamentals.md— Complete foundations on weights, parameters, neural networks, transformers - Then:
22_knowledge_transfer_methods.md— Distillation, fine-tuning, LoRA, when to use each - Systems:
24_hardware_landscape.md— Hardware choices affect training/inference - Advanced:
02_kv_cache_optimization.md(KV cache techniques),04_memory_systems.md(knowledge systems)
For Hardware & Systems Engineers
“Should I buy RTX 4090 or MacBook? What’s unified memory? GPU vs CPU vs TPU?”
- Start:
24_hardware_landscape.md— CPU, GPU, TPU, Apple Silicon, mobile chips, unified memory, cost/performance - Context:
21_model_fundamentals.md(understand what hardware runs) - Optimization:
02_kv_cache_optimization.md(how hardware acceleration works) - Production:
09_operations_and_observability.md,12_deployment_patterns.md
For Roboticists & Embodied AI Engineers
“How do I run AI on robots? What’s the full AI stack for physical systems?”
- Foundations:
21_model_fundamentals.md(how models work) - Agents:
05_ai_agents.md(agentic loop, decision-making) - Hardware:
24_hardware_landscape.md(edge inference, mobile chips, power constraints) - Real-world: Docs 25 & 27 (applications, case studies)
- Systems:
06_harness_architecture.md(orchestration),09_operations_and_observability.md(telemetry)
For Data Scientists Moving to Production
“I have a model. How do I deploy it and keep it working?”
- Architecture:
06_harness_architecture.md(7 essential components) - Testing:
11_testing_and_qa.md(non-deterministic testing, quality metrics) - Ops:
09_operations_and_observability.md(logging, monitoring, cost tracking) - Security:
10_security_and_safety.md(input validation, PII, compliance) - Deploy:
12_deployment_patterns.md(Docker, Kubernetes, CI/CD)
For Platform/DevOps Engineers
“How do I operationalize and scale AI systems?”
- Architecture:
06_harness_architecture.md(components, patterns) - Testing:
11_testing_and_qa.md(quality assurance for non-deterministic systems) - Ops:
09_operations_and_observability.md(structured logging, metrics, cost tracking) - Security:
10_security_and_safety.md(hardening, compliance) - Deploy:
12_deployment_patterns.md(Docker, Kubernetes, CI/CD, scaling) - Integration:
20_integration_patterns.md(API patterns, event-driven, GraphQL, WebSocket)
Hardware Economics: Why Unified Memory Matters
The unified memory advantage (Apple M-series vs traditional GPUs) is a game-changer for inference:
| Aspect | NVIDIA GPU | Apple M-series |
|---|---|---|
| Memory Architecture | Separate CPU/GPU memory + PCIe bus | CPU + GPU share same memory |
| Data Transfer Overhead | Copy CPU→GPU (slow), compute, copy GPU→CPU | No copying, instant access |
| Practical Impact | 20–40% slower for memory-bound workloads | 20–40% faster for many AI tasks |
| Inference Latency | Higher due to data movement | Lower, especially streaming |
| Best Use Case | High-throughput batch inference | Interactive/streaming inference |
In practice: A 13B model on MacBook M3 Max (unified memory) can outperform RTX 4070 for interactive inference despite lower raw TFLOPS, because no PCIe bottleneck.
For production decision-making: Consider unified memory as a +20% throughput advantage for inference workloads.
Production Deployment Checklist
Before Week 6 (Production Readiness)
From 11_testing_and_qa.md:
- Baseline established (success rate, latency, cost measured)
- Regression tests configured (comparative metrics vs baseline)
- Smoke test suite passes (basic functionality verified)
- Load tests pass (concurrent request handling verified)
- Pre-deployment security review complete
From 09_operations_and_observability.md:
- Structured JSON logging implemented
- Key metrics configured (latency p50/p95/p99, throughput, cost)
- Cost tracking active (real-time budget enforcement)
- Health checks implemented (model, memory, tools)
- Dashboard/alerting configured
- Loop detection in place (iteration limits + escape strategies)
From 10_security_and_safety.md:
- Input validation for all untrusted sources
- Output filtering (no PII leaks, no dangerous commands)
- Rate limiting configured (per-user/global)
- Audit logging set up (immutable, compliant)
- Secret scanning complete (no hardcoded API keys)
- Compliance review done (GDPR, HIPAA, FTC AI guidance if applicable)
From 12_deployment_patterns.md (optional but recommended):
- Dockerfile created and tested locally
- Docker image builds successfully
- Kubernetes manifests written (if using K8s)
- Health check probes configured (liveness, readiness, startup)
- CI/CD pipeline automated (lint → test → build → deploy)
- Secrets management configured (no env vars in git)
- Canary or blue-green deployment strategy selected
Consolidated Performance Reference
Model Performance Metrics
| Metric | Typical Value | Context |
|---|---|---|
| SLM (7B–13B) Speed | 10–30× cheaper than LLM | Preferred for agent loops |
| Phi-4 7B Throughput | ~40 tokens/sec (RTX 4090) | Instruction-tuned, fast |
| Mistral 7B Throughput | ~35 tokens/sec (RTX 4090) | Good balance of speed/quality |
| LLM (70B+) Speed | 1–5 tokens/sec (RTX 4090) | Use for verification steps |
| First token latency | 50–200ms | Initial computation time |
| Streaming latency | 1–5ms per token | Subsequent tokens (with KV cache) |
Quantization Impact
| Technique | Speedup | Memory Reduction | Accuracy Loss |
|---|---|---|---|
| AWQ 4-bit | 3–4× | ~75% | <0.5% |
| GPTQ 4-bit | 3–4× | ~75% | <0.5% |
| 8-bit Quantization | 2–2.5× | ~50% | <0.1% |
| GQA (KV cache) | 2-4× (attention) | 2-4× (KV cache) | Minimal |
| TurboQuant (3-bit KV) | Up to 8× (attention) | 6× (KV cache) | Zero loss |
| INT4 KV cache | ~3× (attention) | 4× (KV cache) | Small |
Cost Comparison (April 2026 Pricing)
| Model | Cost per 1M tokens | Best For |
|---|---|---|
| Claude 3.5 Sonnet | $3 input, $15 output | Verification, complex reasoning |
| GPT-4o | $5 input, $15 output | Reasoning, multimodal |
| Llama 3.1 70B (API) | $0.75 input, $0.90 output | Fast reasoning, cost-effective |
| Local SLM (self-hosted) | ~$0 (hardware cost) | Cost optimization, privacy |
| Hybrid (80% local, 20% cloud) | Up to 80-90% cheaper than pure cloud (when most requests route locally) | Recommended harness pattern |
Note: Prices approximate as of early 2025. Check provider websites for current rates.
Key Insights & Takeaways
For Building Your Harness
- Model choice → 7B–13B SLM for loops, 70B+ LLM for verification, quantize to AWQ 4-bit
- Memory architecture → Four-layer (context/working/persistent/auto-consolidation), LLM Wiki for <400K words, <10K startup tokens
- Agentic loop → Start with ReAct, use Plan-and-Execute for long tasks, add Reflexion for quality
- Harness pattern → Single-agent for bounded tasks, Initializer-Executor for long-running, Multi-agent for complex
- Performance & cost → GQA + KV cache quantization for memory savings, SLMs 10-30x cheaper, AWQ quantization 3-4x speedup
2026 Trends Affecting Harness Design
| Trend | Impact | Action |
|---|---|---|
| SLMs dominate agentic AI | Speed critical | Build loop for speed; verify with LLM |
| KV cache quantization | Longer contexts on same hardware | Use GQA models + INT8/INT4 cache; TurboQuant for 3-bit/6x savings |
| LLM Wiki pattern | Alternative to RAG | Use Karpathy’s markdown wiki for ~100 sources |
| Quantization mainstream | 4-bit standard | Default to AWQ quantization |
| Multi-agent orchestration | Specialization via delegation | Consider hierarchical pattern |
| Auto-dream consolidation | Remember across sessions | Implement auto-consolidation |
Building Your First Harness: Recommended Sequence
Week 1: Foundation & Hardware
- Read
21_model_fundamentals.md(1 hour) — understand weights, parameters, transformers - Read
24_hardware_landscape.md(1 hour) — understand hardware trade-offs - Hardware Decision: Choose your development platform:
- Local development: MacBook M3 (16GB) or RTX 4070 desktop ($600)
- Cloud: Use appropriate GPU size for your model
- Edge/Inference: Consider Apple M-series for edge deployment
- Read
01_foundation_models.md(1–2 hours) - Choose a model from HF (15 min)
- Read
05_ai_agents.md— pick ReAct as framework (1–2 hours)
Week 2: Core Architecture
- Read
06_harness_architecture.md(1–2 hours) - Implement minimum viable harness with 3–5 tools and ReAct loop
Week 3: Memory & Optimization
- Read
04_memory_systems.md(1–2 hours) - Implement memory layers (CLAUDE.md, MEMORY.md, topic files)
- Read
02_kv_cache_optimization.md(30 min) - Enable quantization and KV cache optimization
Week 4: Long-Running Harness (If needed)
- Implement Initializer-Executor split
- Add feature list, progress file, self-verification loop
- Test with realistic long-running task
Week 5: Testing & Quality Assurance
- Read
11_testing_and_qa.md(2–3 hours) - Build test infrastructure (unit, integration, load tests)
- Establish baselines and regression detection
- Run pre-deployment validation (success rate ≥90%?)
Week 6: Production Readiness (Before Deploy)
- Read
09_operations_and_observability.md(2–3 hours) - Implement monitoring & logging (JSON logs, metrics, cost tracking)
- Set up cost controls and health checks
- Final deployment checks (tests, logs, alerts, security audit)
Next Steps After Week 6: Advanced Topics & Continuous Learning
Immediate Post-Deployment (First Month)
- Iterate on Observability → Review metrics vs projections, adjust alerting, optimize dashboards
- Security Hardening → Run adversarial testing, review attack patterns, refine rate limiting
- Cost Optimization → Measure actual cost/task, identify expensive patterns, experiment with routing
- Quality Baseline Refinement → Compare actual vs projected success rate, identify failure patterns, tune LLM parameters
Month 2-3: Feature Expansion & Optimization
Choose based on priorities:
- For Cost & Performance: Experiment with KV cache quantization, profile tools, implement caching
- For Reliability at Scale: Implement doc 12 deployment patterns, set up canary deployments, build rollback automation
- For Advanced Reasoning: Try alternative frameworks (ToT, Reflexion), multi-agent patterns, confidence scoring
- For Knowledge Management: Evaluate wiki pattern scaling, hybrid RAG, knowledge versioning
Month 3+: Production Maturity
- Operational Excellence: Runbooks, decision trees, post-mortem process, automated remediation
- Monitoring & Analytics: Continuous benchmarking, quality dashboards, long-term trend tracking
- Advanced Security: Red-teaming, explainability, compliance automation
- Team & Process: Knowledge transfer, playbooks, on-call runbooks
Quarterly Updates
- Review CORPUS_AUDIT.md for improvements and new patterns
- Check model landscape (new SLMs, quantization techniques)
- Re-run baselines on latest models
- Review spending trends and optimization opportunities
Reference: Recommended Tools & Frameworks
| Purpose | Recommendation | Why |
|---|---|---|
| LLM | Claude (Anthropic) | Best safety, reasoning, tool use |
| Open Model | Llama 3 (7B–70B) | Proven, widely deployed |
| SLM | Phi-4 or Mistral 7B | Optimized for instruction-following |
| Quantization | AWQ (4-bit) | Best quality/speed trade-off |
| Memory | Markdown files + git | Human-readable, version-controlled |
| Reasoning Loop | ReAct | Simplest, fastest, proven |
| Testing | pytest + custom harness | Multiple-run tests for non-deterministic systems |
| Monitoring | Prometheus/Datadog | Metrics collection and alerting |
| Logging | Structured JSON | Cost, errors, performance analysis |
| Deployment | Docker + K8s/Serverless | Depends on scale and complexity |
Questions & Next Steps
For terminology help
- Term unclear? → See
glossary.mdfor 125+ definitions covering foundations, hardware, and applications with usage context
For implementation help
- Implementation checklist? →
06_harness_architecture.md - Tool integration? →
05_ai_agents.mdTools section - Memory architecture? →
04_memory_systems.md
For understanding gaps
- Concept unclear? → Links at end of each document
- Model selection stuck? → Flow chart in
03_huggingface_ecosystem.md - Reasoning framework choice? → Comparison in
05_ai_agents.md
To validate your harness
- Has all 7 components? →
06_harness_architecture.md - Memory properly layered? →
04_memory_systems.md - Using proven pattern? →
05_ai_agents.md - Optimal model size? →
01_foundation_models.md+03_huggingface_ecosystem.md
Before production deployment
- Testing complete? →
11_testing_and_qa.mdchecklist - Observability implemented? →
09_operations_and_observability.md - Cost tracking working? → Cost section in
09 - Health checks ready? → Health checks section in
09 - Security hardened? →
10_security_and_safety.mdchecklist - Deployment automated? →
12_deployment_patterns.md(Docker, K8s, CI/CD)
For specific production scenarios
- Agent stuck in loop? → Debugging in
09_operations_and_observability.md - Cost exceeding budget? → Cost section in
09 - Security concern? → Attack vectors in
10_security_and_safety.md - Test results inconsistent? → Non-deterministic testing in
11_testing_and_qa.md
Changelog & Source Attribution
-
April 2026: Expanded to AI/ML Engineering Handbook
- Handbook restructure: Renamed from “Harness Corpus” to “AI/ML Engineering Handbook (With Harness Focus)”
- New Parts structure: Foundations (21-22), Hardware (23-24, 26, 28), Harnesses (01-20), Applications (25, 27)
- New documents: 21 (Model Fundamentals), 22 (Knowledge Transfer), 24 (Hardware Landscape)
- Expanded glossary: 75+ → 125+ terms covering foundations, hardware, systems
- New quick reference: Domain-specific learning paths (ML engineers, hardware engineers, roboticists, platform engineers)
- Hardware decision guide: How to choose development/deployment hardware
- Hardware economics section: Unified memory advantage explanation
- New role-based learning paths: Hardware engineers, ML engineers, roboticists
- Updated building sequence: Week 1 now includes hardware selection
-
Original April 2026: Created corpus + improvements
- Original 8 documents (01–08)
- CORPUS_AUDIT.md: Comprehensive gap analysis
- New documents: 09 (Operations), 10 (Security), 11 (Testing), 12 (Deployment)
- New glossary: 75+ terms defined with context
- Index improvements: Phase labeling, consolidated metrics, clearer paths, post-deployment guidance
- KV cache optimization techniques (GQA, PagedAttention, INT8/INT4, TurboQuant — Google Research Blog, ICLR 2026)
- LLM Wiki pattern (compiled markdown knowledge — Karpathy’s LLM Wiki Gist, April 2026)
- Claude Code memory architecture (Anthropic)
- Open-source agent framework patterns
For citations and detailed sources, see individual document footers.
See Also
- Doc 01 (Foundation Models) — Understand what models are available and when to use each; essential context for building your harness
- Doc 06 (Harness Architecture) — Learn the seven components of a complete system; start here after understanding models
- Doc 09 (Operations & Observability) — Master monitoring and debugging before deploying anything to production; part of critical Phase 1
- Doc 21 (Model Fundamentals) — Dive deeper into how neural networks and transformers actually work; for researchers and those wanting deeper understanding