Skip to main content
Reference

The Harness Handbook — Start Here

Master navigation, learning paths, role-based guides, and goal-based workflows for the complete AI/ML Engineering Handbook.

A comprehensive guide to AI/ML engineering across foundations, hardware, agent harnesses, and real-world applications. Whether you’re building models, choosing hardware, designing agents, or deploying to production, this handbook provides the knowledge to succeed.

Status: April 2026 | Covers KV cache optimization (GQA, PagedAttention, TurboQuant), Karpathy’s LLM Wiki pattern, and production harness architecture


Handbook Structure: Four Parts

The handbook is organized into four major parts, each serving different needs:

Part 1: AI/ML Foundations (Docs 21-22)

Understand how AI models work at a fundamental level.

  • Doc 21: Model Fundamentals — Weights, parameters, neural networks, transformers, training
  • Doc 22: Knowledge Transfer Methods — Distillation, fine-tuning, LoRA, RAG

Part 2: Hardware & Systems (Docs 23-24, 26, 28)

Choose the right hardware and understand system-level concerns.

  • Doc 24: Hardware Landscape — CPU vs GPU, NVIDIA, Apple Silicon, mobile chips, cost/performance trade-offs
  • Docs 23, 26, 28: (Additional hardware and systems topics)

Part 3: Agent Harnesses (Docs 01-20)

Design, build, and deploy production AI agent systems.

  • Foundation models, agents, memory, security, testing, deployment
  • Reference implementations and patterns
  • This is the original core curriculum

Part 4: Real-World Applications (Docs 25, 27)

Learn how to apply harnesses to specific domains and use cases.


What Happened? Handbook Improvements (April 18, 2026)

This handbook was restructured and expanded from “Harness Corpus” to comprehensive AI/ML handbook. Key improvements:

  • Expanded glossary (glossary.md) — 125+ terms covering foundations, hardware, and applications
  • New Part 1 & 2 — Model fundamentals and hardware landscape added
  • Three critical foundation documents — Operations & Observability (09), Security & Safety (10), Testing & QA (11)
  • Deployment guide (12_deployment_patterns.md) — Docker, Kubernetes, CI/CD patterns
  • Single primary path — Clearer “Recommended Path” with alternatives clearly marked
  • Phase labeling — Documents marked as Phase 1 (Critical) or Phase 2 (Important)
  • Production deployment checklist — Clear pre-deployment validation steps
  • Consolidated performance metrics — One reference table for all benchmarks
  • “Next Steps After Week 6” — Advanced topics and continuous learning
  • New “Quick Reference by Domain” — Find answers by role and goal

Quick Navigation

New to these terms? → See glossary.md for definitions of Agent, Harness, LLM, Token, ReAct, and 75+ other key concepts.

Choose based on your background:

Option A: Fast Track (For Practitioners, 5-6 weeks)

  1. 01_foundation_models.md — What models exist and when to use them
  2. 05_ai_agents.md — How agents think and decide (ReAct, Chain-of-Thought, Tree of Thoughts, etc.)
  3. 06_harness_architecture.md — Seven components of a complete harness
  4. 08_claw_code_python.md — Reference implementation in Python
  5. 04_memory_systems.md — Memory systems and the LLM Wiki pattern (compiled markdown knowledge)
  6. 02_kv_cache_optimization.md — Optimization for long contexts

Then proceed to “Production Deployment” below.

Option B: Foundations-First (For Researchers, 7-8 weeks)

  1. 21_model_fundamentals.md — How neural networks, transformers, and weights actually work
  2. 22_knowledge_transfer_methods.md — Distillation, fine-tuning, LoRA, RAG (theoretical foundations)
  3. 01_foundation_models.md — Practical model selection
  4. 05_ai_agents.md — Agentic reasoning frameworks
  5. 06_harness_architecture.md — Harness design
  6. 24_hardware_landscape.md — Hardware understanding (CPU, GPU, Neural Engines, cost/performance)
  7. Rest of Option A

Then, choose your path based on your needs:

For Production Deployment (After step 4 in Option A or step 6 in Option B):11_testing_and_qa.md (establish quality baselines) [PHASE 1 - CRITICAL] → 09_operations_and_observability.md (monitoring & debugging) [PHASE 1 - CRITICAL] → 10_security_and_safety.md (security hardening) [PHASE 1 - CRITICAL] → 12_deployment_patterns.md (containerization & orchestration) [PHASE 2 - Important]

Alternative Starting Points

If you’re building in Python right now → Jump directly to 08_claw_code_python.md, then revisit foundational docs as needed

If you only care about optimization02_kv_cache_optimization.md + 03_huggingface_ecosystem.md (quantization specifics)

If you’re researching agents05_ai_agents.md04_memory_systems.md → then reference implementations in 08

Quick lookup by topic:

  • What are weights and how do neural networks work?21_model_fundamentals.md (weights, parameters, neurons, layers, transformers)
  • What’s the difference between distillation and fine-tuning?22_knowledge_transfer_methods.md
  • Should I buy RTX 4090 or MacBook Pro?24_hardware_landscape.md (GPU vs CPU, Apple Silicon, unified memory, cost/performance)
  • How do I choose a model?01_foundation_models.md + 03_huggingface_ecosystem.md
  • How do I optimize inference?02_kv_cache_optimization.md (GQA, PagedAttention, KV cache quantization, TurboQuant)
  • How do I build memory systems?04_memory_systems.md (includes Karpathy’s LLM Wiki pattern for compiled markdown knowledge)
  • How do agents work?05_ai_agents.md (ReAct, ToT, Reflexion frameworks)
  • What are the components of a harness?06_harness_architecture.md
  • I’m building in Python, where do I start?08_claw_code_python.md (installation + patterns)
  • How do I monitor & debug production harnesses?09_operations_and_observability.md (logging, metrics, cost tracking, debugging) [PHASE 1]
  • How do I protect my harness from attack?10_security_and_safety.md (injection, validation, rate limiting, compliance) [PHASE 1]
  • How do I test my harness before production?11_testing_and_qa.md (non-deterministic testing, regression, quality metrics) [PHASE 1]
  • How do I deploy my harness to production?12_deployment_patterns.md (Docker, Kubernetes, CI/CD, scaling) [PHASE 2]
  • How do I embed a harness into my application?20_integration_patterns.md (REST API, background jobs, events, GraphQL, WebSocket, Slack/Discord bots) [PHASE 2]

By role…

Software Engineer Building Python Harness (Use Option A: Fast Track)

  • Phase 1 (Critical): 08 (Claw-Code), 05 (Agents), 06 (Architecture), 04 (Memory), 11 (Testing), 09 (Monitoring), 10 (Security)
  • Phase 2: 12 (Deployment), 20 (Integration), 03 (HF ecosystem), 02 (Optimization)
  • Optional Reference: 01 (Models), 21 (Fundamentals) — read when you need theoretical grounding

ML Engineer / ML Researcher (Use Option B: Foundations-First)

  • Phase 1 Foundations: 21 (Model fundamentals), 22 (Knowledge transfer methods), 24 (Hardware landscape)
  • Phase 2 Applications: 01 (Foundation models), 03 (Hugging Face), 04 (Memory systems), 02 (KV cache)
  • Deep dive topics: KV cache techniques in doc 02, LLM Wiki pattern in 04, quantization details in 03
  • Build production: 05 (Agents) → 06 (Architecture) → 08 (Implementation) → 09-11 (Quality gates)

Hardware Engineer / Systems Builder

  • Week 1: 24 (Hardware landscape) — CPU vs GPU, Apple Silicon, mobile chips, unified memory, cost/performance analysis
  • Week 2: 21 (Model fundamentals), 02 (KV cache), 01 (Model selection) — understand how hardware impacts inference
  • Week 3: 28 (Unified memory economics) — deep dive into Apple M-series advantages
  • Week 4: 09 (Observability), 12 (Deployment) — production concerns
  • Focus areas: VRAM requirements, thermal management, cost per inference, memory bandwidth

Roboticist / Physical AI Engineer

  • Foundations: 21 (Model fundamentals), 24 (Hardware) — especially mobile/edge sections
  • Agentic systems: 05 (Agents), 06 (Harness architecture)
  • On-device inference: 23 (Apple Intelligence & CoreML), 25 (Edge & Physical AI)
  • Real-world applications: 27 (Real-world AI applications — section on robotics/autonomous vehicles)
  • Production setup: 09 (Observability for robot telemetry), 12 (Deployment)
  • Then build: 08 (Implementation), 04 (Memory for robot decision-making)

Learning Agentic AI from Scratch (Complete Beginner)

  • Start with fundamentals: Option B path (21 → 22 → 01 → 05 → 06 → 08)
  • Then master architecture: 04 (Memory) → 02 (Optimization) → 03 (Model ecosystem)
  • Then build for production: 11 (Testing) → 09 (Monitoring) → 10 (Security) → 12 (Deployment)
  • Skip until ready: 07 (Open-source agent architectures) — only needed after understanding core concepts
  • Reference: glossary.md for any unfamiliar terms

Product Manager / Architecture Decision Maker

  • Understand capabilities: 01 (Foundation models), 05 (Reasoning frameworks) — why harnesses make sense
  • Understand architecture: 06 (Seven harness components), 08 (Python implementation)
  • Cost decisions: 13 (Cost management) + 24 (Hardware landscape) — understand potential cost savings with hybrid routing
  • Risk management: 10 (Security & safety), 17 (Regulatory & ethics)
  • Key insights: Cost (SLMs 10-30× cheaper), Speed (agent loops 100-1000× faster), Capability (LLMs for verification)

Wave 4 Documents: When to Read These

Wave 4 (Docs 21-28) are the “AI/ML Foundations & Hardware” section. They’re recommended for everyone eventually, but timing matters:

Your BackgroundWhen to Read Wave 4Priority
Software engineer (practicing)After understanding harnesses (docs 01-08), read 21-22 when you want deeper model knowledgeOptional
ML engineer (doing research)Read first (21-22, 24), before diving into harness-specific docsCritical
Hardware specialistStart with doc 24, then 21-22, then understand inference implicationsCritical
RoboticistRead 21 (fundamentals) and 24 (hardware), then read 25 for applicationsImportant
DevOps/SRERead 24 (hardware) and 13 (cost management); skip 21-22 unless optimizing modelsOptional
Complete beginnerFollow “Option B: Foundations-First” path (uses all Wave 4 sequentially)Critical

Quick decisions:

  • “I want to understand how models work” → 21 (Model Fundamentals) + 22 (Knowledge Transfer)
  • “I want to choose hardware” → 24 (Hardware Landscape) + 28 (Unified Memory Economics)
  • “I want to apply to real domains” → 25 (Edge & Physical AI) + 27 (Real-World Applications)
  • “I want everything” → Follow Option B path in “Recommended Primary Path” above

Common Workflows: “I want to…”

Goal-based navigation for specific outcomes. Each workflow shows the doc sequence, estimated time, and what you’ll achieve.

”I want to build a customer support bot” (2-3 weeks)

Week 1: Understand + Design
  01 → Choose SLM for triage, LLM for escalation (2h)
  05 → Learn ReAct loop, build intent classifier (3h)
  06 → Design harness: tools (ticket API, KB search), memory (session), loop (ReAct) (3h)
  08 → Clone starter harness, install dependencies, first working loop (4h)

Week 2: Build + Test
  04 → Implement session memory + persistent FAQ knowledge base (3h)
  15 → Design system prompt: "You are a support agent. Classify, answer, or escalate." (2h)
  11 → Test with 50+ sample conversations, measure success rate (4h)
  10 → Add input validation, rate limiting, PII filtering (3h)

Week 3: Deploy + Monitor
  09 → Add structured logging: ticket_id, intent, resolution, cost (3h)
  12 → Dockerize, deploy to staging, run integration tests (4h)
  13 → Set up cost tracking: cost per ticket, daily budget alerts (2h)
  18 → Create runbook: "Agent stuck in loop", "High cost alert" (1h)

Result: Production-ready support bot with monitoring, security, and cost controls.


”I want to reduce my harness costs by 50%” (1 week)

Day 1: Understand Current Costs
  13 → Implement token counting if not done, measure baseline (3h)
  09 → Review logs: which operations burn the most tokens? (2h)

Day 2: Quick Wins (typically saves 30-50%)
  02 → Enable KV cache quantization (GQA, INT8/INT4) (1h)
  01 → Switch to SLM for simple tasks (classification, routing) (2h)
  15 → Shorten system prompts, remove redundant instructions (1h)

Day 3: Deeper Optimizations
  03 → Quantize model (INT4 or INT8) for faster, cheaper inference (2h)
  14 → Add caching for repeated queries, memoize tool results (3h)
  04 → Trim memory: only load what's needed per session (1h)

Day 4-5: Validate + Monitor
  13 → Compare new vs old cost per request, set budget alerts (2h)
  11 → Regression test: did quality drop? If >5% drop, roll back that change (3h)

Result: 40-70% cost reduction while maintaining 90%+ quality. Typical savings: $2K-$10K/month.


”I want to deploy to production safely” (1-2 weeks)

Pre-Flight (Days 1-3)
  11 → Run quality tests: 50+ test cases, success rate ≥90% (4h)
  10 → Security audit: input validation, prompt injection defense, rate limiting (3h)
  09 → Implement structured logging: every request, every error, every cost (3h)
  17 → Compliance check: GDPR data handling, audit trail, user consent (2h)

Deployment (Days 4-7)
  12 → Dockerize application, write K8s manifests or serverless config (4h)
  12 → Deploy to staging, run smoke tests (2h)
  09 → Connect monitoring: dashboards, alerts, on-call rotation (3h)
  13 → Set production cost budgets and alerts (1h)

Go-Live (Days 8-10)
  12 → Deploy to production with canary (10% traffic) (2h)
  18 → Prepare runbook: common failures and response procedures (2h)
  09 → Monitor first 48 hours: latency, errors, cost, quality (ongoing)
  13 → Review first week costs vs projections (1h)

Result: Production deployment with monitoring, security, cost controls, and incident procedures.


”I want to deploy AI on edge devices” (2-3 weeks)

Week 1: Foundations
  21 → Understand model architecture, what can be compressed (3h)
  24 → Choose hardware: phone chip, Raspberry Pi, custom board (3h)
  22 → Learn distillation (shrink cloud model → edge model) (3h)
  03 → Find quantized models (GGUF, INT4) for your hardware (2h)

Week 2: Implementation
  25 → Edge deployment patterns, latency budgets, power constraints (4h)
  23 → If Apple: CoreML conversion, Neural Engine optimization (4h)
  26 → If cross-platform: ONNX export, TensorRT or TF Lite (4h)
  02 → KV cache optimization for limited memory (2h)

Week 3: Integration + Testing
  06 → Design harness for edge: minimal memory, fast loop, local tools (3h)
  28 → Unified memory math: how much model fits on your device? (2h)
  11 → Test on actual hardware: latency, accuracy, battery life (4h)
  27 → Real-world deployment patterns from robotics, automotive, IoT (3h)

Result: Working model on edge hardware with optimized latency and power consumption.


”I want to understand agentic AI from scratch” (4-6 weeks)

Week 1-2: Theory
  21 → Neural networks, transformers, weights, training fundamentals (5h)
  22 → Knowledge transfer: distillation, fine-tuning, RAG (4h)
  01 → Foundation models: LLM vs SLM, when to use each (2h)
  05 → Agent frameworks: CoT, ReAct, Tree of Thoughts, Reflexion (4h)

Week 3: Architecture
  06 → Seven components of a harness (3h)
  04 → Memory systems: four layers, RAG, LLM Wiki pattern (3h)
  02 → KV cache and inference optimization (2h)
  03 → Hugging Face ecosystem: finding and evaluating models (2h)

Week 4-5: Build
  08 → Clone Claw-Code, build your first agent (6h)
  14 → Advanced patterns: tool composition, state machines, caching (4h)
  15 → Prompt engineering: make your agent smarter (3h)
  24 → Hardware: understand GPU/CPU trade-offs for your setup (2h)

Week 6: Production
  11 → Testing non-deterministic systems (3h)
  09 → Operations and observability (3h)
  10 → Security hardening (2h)
  12 → Deployment patterns (2h)

Result: Complete understanding of agentic AI from theory through production deployment.


”I want to improve my agent’s quality” (3-5 days)

Day 1: Measure Current Quality
  16 → Set up evaluation framework: accuracy, relevance, coherence metrics (3h)
  11 → Run baseline tests: 50+ cases, record success rate (2h)

Day 2: Improve Prompts
  15 → Rewrite system prompt with few-shot examples (2h)
  05 → Check: is your reasoning framework right for the task? (1h)
  14 → Add self-correction loop or verification step (2h)

Day 3: Improve Knowledge
  04 → Review memory system: is the right context loaded? (2h)
  22 → Consider fine-tuning if domain-specific accuracy < 85% (3h)

Day 4-5: Validate
  16 → Re-run evaluation, compare to baseline (2h)
  11 → Regression test on original tasks (still works?) (2h)
  13 → Check: did quality improvements increase cost? Worth it? (1h)

Result: Measured quality improvement with clear before/after metrics.


Document Guide

Phase 1: Critical Documents

08_claw_code_python.md (START HERE IF USING PYTHON)

“How do I build a harness in Python?”

Guide to building a Python-based AI agent harness using common production patterns from open-source agent frameworks.

Core concepts:

  • Dual-layer architecture: Python orchestration + compiled runtime
  • Multi-provider LLM support (Claude, OpenAI, Gemini, local Ollama)
  • Tool registry patterns + extensible via filesystem
  • Model Context Protocol (MCP) integration

Read this if: You’re building in Python, want transparent architecture, need cost optimisation, or want to learn agent design.

Estimated time: 1-2 hours to understand + install, 4-6 hours to build basic harness


09_operations_and_observability.md (CRITICAL FOR PRODUCTION)

“How do I monitor, debug, and operate production harnesses?”

This is the missing operations manual. When agents are live, you need visibility into what’s happening. Covers structured logging, metrics, cost tracking, debugging stuck agents, health checks, and graceful degradation.

Read this if: Taking harness to production, setting up monitoring, implementing cost controls, debugging live agents

Estimated time: 2-3 hours to understand, 4-6 hours to implement for your harness


10_security_and_safety.md (BEFORE PRODUCTION)

“How do I protect my harness from attack and ensure compliance?”

Essential defensive strategies separating production harnesses from prototypes. Covers prompt injection prevention, input/output validation, rate limiting, sandboxing, PII handling, audit logging, and regulatory compliance (GDPR, HIPAA, SOC 2).

Read this if: Building production harnesses, handling sensitive data, meeting regulatory requirements, defending against adversarial attacks

Estimated time: 2-3 hours to understand, 4-8 hours to implement security controls for your harness


11_testing_and_qa.md (CRITICAL BEFORE PRODUCTION)

“How do I test harnesses that produce different outputs each time?”

Testing manual for non-deterministic AI systems. LLMs don’t have “pass/fail” tests—they have success rates. Covers non-deterministic testing, regression detection, quality metrics, and pre-deployment validation.

Read this if: Testing your harness before production, detecting quality regressions, ensuring reliability, validating cost projections

Estimated time: 2–3 hours to understand, 1–2 weeks to implement full test infrastructure


Phase 2: Important Documents

12_deployment_patterns.md (FROM TESTING TO PRODUCTION)

“How do I deploy my harness to production at scale?”

Operations manual for taking harnesses from local testing to reliable, scalable production. Covers Docker containerization, Kubernetes orchestration, serverless patterns, CI/CD pipelines, configuration management, scaling strategies, and health monitoring.

Read this if: Deploying harness to production, setting up CI/CD, containerizing Python code, scaling horizontally, implementing health checks

Estimated time: 2–4 hours to understand, 1–2 weeks to implement full deployment pipeline


20_integration_patterns.md (INTEGRATING WITH EXISTING SYSTEMS)

“How do I embed a harness into my application?”

Comprehensive guide to integrating harnesses into production systems. Covers 12 integration patterns: harness as library, REST API, async background jobs, event-driven (Kafka/Pub-Sub), GraphQL, WebSocket/streaming, third-party bots (Slack/Discord/Telegram), database integration, file system access, and monitoring/observability/authentication layers.

Read this if: Embedding harness into existing applications, building APIs, connecting to databases, creating chatbots, implementing event-driven architectures

Estimated time: 2-3 hours to understand, 1-2 weeks to implement integration for your architecture


Foundational Reference Documents

01_foundation_models.md

“What kinds of models exist and when to use each?” [PHASE 1]

LLM vs SLM, multimodal models, training vs inference costs, and when to use each. Key decision: 7B–13B SLM for agent loops, 70B+ LLM for verification.

02_kv_cache_optimization.md

“How do we run longer context efficiently?” [PHASE 1]

KV cache fundamentals and modern optimization techniques: Grouped Query Attention (GQA), PagedAttention (vLLM), INT8/INT4 KV cache quantization, and TurboQuant (3-bit, 6x memory reduction, zero accuracy loss — ICLR 2026).

03_huggingface_ecosystem.md

“Where do I find models and how do I evaluate them?” [PHASE 1]

Finding models on Hugging Face, quantization options (AWQ, GPTQ, 8-bit), and performance trade-offs. Includes decision tree for model selection.

04_memory_systems.md

“How do agents remember, learn, and maintain knowledge?” [PHASE 1]

Four-layer memory architecture, RAG, and Karpathy’s LLM Wiki Pattern (compiled markdown knowledge for bases under ~100 sources). Includes Claude Code’s proven pattern.

05_ai_agents.md

“How do agents think and make decisions?” [PHASE 1]

Agentic loop definition, reasoning frameworks (ReAct, Tree of Thoughts, Reflexion, etc.), and recommendations for harnesses.

06_harness_architecture.md

“What is a complete harness and how do I build one?” [PHASE 1]

Definition, seven essential components, proven patterns (Single-Agent, Initializer-Executor, Multi-Agent), implementation checklists, and performance optimizations.

07_openclaw_reference.md

“What can we learn from open-source agent architectures?” [REFERENCE ONLY]

Deep dive into common patterns from open-source agent frameworks (file-based tool registry, skill composition, multi-agent coordination). Read after understanding core concepts in 06.


Quick Reference by Domain

For ML Engineers & AI Researchers

“What are weights? How do transformers work? How do I transfer knowledge?”

  1. Start: 21_model_fundamentals.md — Complete foundations on weights, parameters, neural networks, transformers
  2. Then: 22_knowledge_transfer_methods.md — Distillation, fine-tuning, LoRA, when to use each
  3. Systems: 24_hardware_landscape.md — Hardware choices affect training/inference
  4. Advanced: 02_kv_cache_optimization.md (KV cache techniques), 04_memory_systems.md (knowledge systems)

For Hardware & Systems Engineers

“Should I buy RTX 4090 or MacBook? What’s unified memory? GPU vs CPU vs TPU?”

  1. Start: 24_hardware_landscape.md — CPU, GPU, TPU, Apple Silicon, mobile chips, unified memory, cost/performance
  2. Context: 21_model_fundamentals.md (understand what hardware runs)
  3. Optimization: 02_kv_cache_optimization.md (how hardware acceleration works)
  4. Production: 09_operations_and_observability.md, 12_deployment_patterns.md

For Roboticists & Embodied AI Engineers

“How do I run AI on robots? What’s the full AI stack for physical systems?”

  1. Foundations: 21_model_fundamentals.md (how models work)
  2. Agents: 05_ai_agents.md (agentic loop, decision-making)
  3. Hardware: 24_hardware_landscape.md (edge inference, mobile chips, power constraints)
  4. Real-world: Docs 25 & 27 (applications, case studies)
  5. Systems: 06_harness_architecture.md (orchestration), 09_operations_and_observability.md (telemetry)

For Data Scientists Moving to Production

“I have a model. How do I deploy it and keep it working?”

  1. Architecture: 06_harness_architecture.md (7 essential components)
  2. Testing: 11_testing_and_qa.md (non-deterministic testing, quality metrics)
  3. Ops: 09_operations_and_observability.md (logging, monitoring, cost tracking)
  4. Security: 10_security_and_safety.md (input validation, PII, compliance)
  5. Deploy: 12_deployment_patterns.md (Docker, Kubernetes, CI/CD)

For Platform/DevOps Engineers

“How do I operationalize and scale AI systems?”

  1. Architecture: 06_harness_architecture.md (components, patterns)
  2. Testing: 11_testing_and_qa.md (quality assurance for non-deterministic systems)
  3. Ops: 09_operations_and_observability.md (structured logging, metrics, cost tracking)
  4. Security: 10_security_and_safety.md (hardening, compliance)
  5. Deploy: 12_deployment_patterns.md (Docker, Kubernetes, CI/CD, scaling)
  6. Integration: 20_integration_patterns.md (API patterns, event-driven, GraphQL, WebSocket)

Hardware Economics: Why Unified Memory Matters

The unified memory advantage (Apple M-series vs traditional GPUs) is a game-changer for inference:

AspectNVIDIA GPUApple M-series
Memory ArchitectureSeparate CPU/GPU memory + PCIe busCPU + GPU share same memory
Data Transfer OverheadCopy CPU→GPU (slow), compute, copy GPU→CPUNo copying, instant access
Practical Impact20–40% slower for memory-bound workloads20–40% faster for many AI tasks
Inference LatencyHigher due to data movementLower, especially streaming
Best Use CaseHigh-throughput batch inferenceInteractive/streaming inference

In practice: A 13B model on MacBook M3 Max (unified memory) can outperform RTX 4070 for interactive inference despite lower raw TFLOPS, because no PCIe bottleneck.

For production decision-making: Consider unified memory as a +20% throughput advantage for inference workloads.


Production Deployment Checklist

Before Week 6 (Production Readiness)

From 11_testing_and_qa.md:

  • Baseline established (success rate, latency, cost measured)
  • Regression tests configured (comparative metrics vs baseline)
  • Smoke test suite passes (basic functionality verified)
  • Load tests pass (concurrent request handling verified)
  • Pre-deployment security review complete

From 09_operations_and_observability.md:

  • Structured JSON logging implemented
  • Key metrics configured (latency p50/p95/p99, throughput, cost)
  • Cost tracking active (real-time budget enforcement)
  • Health checks implemented (model, memory, tools)
  • Dashboard/alerting configured
  • Loop detection in place (iteration limits + escape strategies)

From 10_security_and_safety.md:

  • Input validation for all untrusted sources
  • Output filtering (no PII leaks, no dangerous commands)
  • Rate limiting configured (per-user/global)
  • Audit logging set up (immutable, compliant)
  • Secret scanning complete (no hardcoded API keys)
  • Compliance review done (GDPR, HIPAA, FTC AI guidance if applicable)

From 12_deployment_patterns.md (optional but recommended):

  • Dockerfile created and tested locally
  • Docker image builds successfully
  • Kubernetes manifests written (if using K8s)
  • Health check probes configured (liveness, readiness, startup)
  • CI/CD pipeline automated (lint → test → build → deploy)
  • Secrets management configured (no env vars in git)
  • Canary or blue-green deployment strategy selected

Consolidated Performance Reference

Model Performance Metrics

MetricTypical ValueContext
SLM (7B–13B) Speed10–30× cheaper than LLMPreferred for agent loops
Phi-4 7B Throughput~40 tokens/sec (RTX 4090)Instruction-tuned, fast
Mistral 7B Throughput~35 tokens/sec (RTX 4090)Good balance of speed/quality
LLM (70B+) Speed1–5 tokens/sec (RTX 4090)Use for verification steps
First token latency50–200msInitial computation time
Streaming latency1–5ms per tokenSubsequent tokens (with KV cache)

Quantization Impact

TechniqueSpeedupMemory ReductionAccuracy Loss
AWQ 4-bit3–4×~75%<0.5%
GPTQ 4-bit3–4×~75%<0.5%
8-bit Quantization2–2.5×~50%<0.1%
GQA (KV cache)2-4× (attention)2-4× (KV cache)Minimal
TurboQuant (3-bit KV)Up to 8× (attention)6× (KV cache)Zero loss
INT4 KV cache~3× (attention)4× (KV cache)Small

Cost Comparison (April 2026 Pricing)

ModelCost per 1M tokensBest For
Claude 3.5 Sonnet$3 input, $15 outputVerification, complex reasoning
GPT-4o$5 input, $15 outputReasoning, multimodal
Llama 3.1 70B (API)$0.75 input, $0.90 outputFast reasoning, cost-effective
Local SLM (self-hosted)~$0 (hardware cost)Cost optimization, privacy
Hybrid (80% local, 20% cloud)Up to 80-90% cheaper than pure cloud (when most requests route locally)Recommended harness pattern

Note: Prices approximate as of early 2025. Check provider websites for current rates.


Key Insights & Takeaways

For Building Your Harness

  1. Model choice → 7B–13B SLM for loops, 70B+ LLM for verification, quantize to AWQ 4-bit
  2. Memory architecture → Four-layer (context/working/persistent/auto-consolidation), LLM Wiki for <400K words, <10K startup tokens
  3. Agentic loop → Start with ReAct, use Plan-and-Execute for long tasks, add Reflexion for quality
  4. Harness pattern → Single-agent for bounded tasks, Initializer-Executor for long-running, Multi-agent for complex
  5. Performance & cost → GQA + KV cache quantization for memory savings, SLMs 10-30x cheaper, AWQ quantization 3-4x speedup
TrendImpactAction
SLMs dominate agentic AISpeed criticalBuild loop for speed; verify with LLM
KV cache quantizationLonger contexts on same hardwareUse GQA models + INT8/INT4 cache; TurboQuant for 3-bit/6x savings
LLM Wiki patternAlternative to RAGUse Karpathy’s markdown wiki for ~100 sources
Quantization mainstream4-bit standardDefault to AWQ quantization
Multi-agent orchestrationSpecialization via delegationConsider hierarchical pattern
Auto-dream consolidationRemember across sessionsImplement auto-consolidation

Week 1: Foundation & Hardware

  1. Read 21_model_fundamentals.md (1 hour) — understand weights, parameters, transformers
  2. Read 24_hardware_landscape.md (1 hour) — understand hardware trade-offs
  3. Hardware Decision: Choose your development platform:
    • Local development: MacBook M3 (16GB) or RTX 4070 desktop ($600)
    • Cloud: Use appropriate GPU size for your model
    • Edge/Inference: Consider Apple M-series for edge deployment
  4. Read 01_foundation_models.md (1–2 hours)
  5. Choose a model from HF (15 min)
  6. Read 05_ai_agents.md — pick ReAct as framework (1–2 hours)

Week 2: Core Architecture

  1. Read 06_harness_architecture.md (1–2 hours)
  2. Implement minimum viable harness with 3–5 tools and ReAct loop

Week 3: Memory & Optimization

  1. Read 04_memory_systems.md (1–2 hours)
  2. Implement memory layers (CLAUDE.md, MEMORY.md, topic files)
  3. Read 02_kv_cache_optimization.md (30 min)
  4. Enable quantization and KV cache optimization

Week 4: Long-Running Harness (If needed)

  1. Implement Initializer-Executor split
  2. Add feature list, progress file, self-verification loop
  3. Test with realistic long-running task

Week 5: Testing & Quality Assurance

  1. Read 11_testing_and_qa.md (2–3 hours)
  2. Build test infrastructure (unit, integration, load tests)
  3. Establish baselines and regression detection
  4. Run pre-deployment validation (success rate ≥90%?)

Week 6: Production Readiness (Before Deploy)

  1. Read 09_operations_and_observability.md (2–3 hours)
  2. Implement monitoring & logging (JSON logs, metrics, cost tracking)
  3. Set up cost controls and health checks
  4. Final deployment checks (tests, logs, alerts, security audit)

Next Steps After Week 6: Advanced Topics & Continuous Learning

Immediate Post-Deployment (First Month)

  1. Iterate on Observability → Review metrics vs projections, adjust alerting, optimize dashboards
  2. Security Hardening → Run adversarial testing, review attack patterns, refine rate limiting
  3. Cost Optimization → Measure actual cost/task, identify expensive patterns, experiment with routing
  4. Quality Baseline Refinement → Compare actual vs projected success rate, identify failure patterns, tune LLM parameters

Month 2-3: Feature Expansion & Optimization

Choose based on priorities:

  • For Cost & Performance: Experiment with KV cache quantization, profile tools, implement caching
  • For Reliability at Scale: Implement doc 12 deployment patterns, set up canary deployments, build rollback automation
  • For Advanced Reasoning: Try alternative frameworks (ToT, Reflexion), multi-agent patterns, confidence scoring
  • For Knowledge Management: Evaluate wiki pattern scaling, hybrid RAG, knowledge versioning

Month 3+: Production Maturity

  • Operational Excellence: Runbooks, decision trees, post-mortem process, automated remediation
  • Monitoring & Analytics: Continuous benchmarking, quality dashboards, long-term trend tracking
  • Advanced Security: Red-teaming, explainability, compliance automation
  • Team & Process: Knowledge transfer, playbooks, on-call runbooks

Quarterly Updates

  • Review CORPUS_AUDIT.md for improvements and new patterns
  • Check model landscape (new SLMs, quantization techniques)
  • Re-run baselines on latest models
  • Review spending trends and optimization opportunities

PurposeRecommendationWhy
LLMClaude (Anthropic)Best safety, reasoning, tool use
Open ModelLlama 3 (7B–70B)Proven, widely deployed
SLMPhi-4 or Mistral 7BOptimized for instruction-following
QuantizationAWQ (4-bit)Best quality/speed trade-off
MemoryMarkdown files + gitHuman-readable, version-controlled
Reasoning LoopReActSimplest, fastest, proven
Testingpytest + custom harnessMultiple-run tests for non-deterministic systems
MonitoringPrometheus/DatadogMetrics collection and alerting
LoggingStructured JSONCost, errors, performance analysis
DeploymentDocker + K8s/ServerlessDepends on scale and complexity

Questions & Next Steps

For terminology help

  • Term unclear? → See glossary.md for 125+ definitions covering foundations, hardware, and applications with usage context

For implementation help

  • Implementation checklist? → 06_harness_architecture.md
  • Tool integration? → 05_ai_agents.md Tools section
  • Memory architecture? → 04_memory_systems.md

For understanding gaps

  • Concept unclear? → Links at end of each document
  • Model selection stuck? → Flow chart in 03_huggingface_ecosystem.md
  • Reasoning framework choice? → Comparison in 05_ai_agents.md

To validate your harness

  • Has all 7 components? → 06_harness_architecture.md
  • Memory properly layered? → 04_memory_systems.md
  • Using proven pattern? → 05_ai_agents.md
  • Optimal model size? → 01_foundation_models.md + 03_huggingface_ecosystem.md

Before production deployment

  • Testing complete? → 11_testing_and_qa.md checklist
  • Observability implemented? → 09_operations_and_observability.md
  • Cost tracking working? → Cost section in 09
  • Health checks ready? → Health checks section in 09
  • Security hardened? → 10_security_and_safety.md checklist
  • Deployment automated? → 12_deployment_patterns.md (Docker, K8s, CI/CD)

For specific production scenarios

  • Agent stuck in loop? → Debugging in 09_operations_and_observability.md
  • Cost exceeding budget? → Cost section in 09
  • Security concern? → Attack vectors in 10_security_and_safety.md
  • Test results inconsistent? → Non-deterministic testing in 11_testing_and_qa.md

Changelog & Source Attribution

  • April 2026: Expanded to AI/ML Engineering Handbook

    • Handbook restructure: Renamed from “Harness Corpus” to “AI/ML Engineering Handbook (With Harness Focus)”
    • New Parts structure: Foundations (21-22), Hardware (23-24, 26, 28), Harnesses (01-20), Applications (25, 27)
    • New documents: 21 (Model Fundamentals), 22 (Knowledge Transfer), 24 (Hardware Landscape)
    • Expanded glossary: 75+ → 125+ terms covering foundations, hardware, systems
    • New quick reference: Domain-specific learning paths (ML engineers, hardware engineers, roboticists, platform engineers)
    • Hardware decision guide: How to choose development/deployment hardware
    • Hardware economics section: Unified memory advantage explanation
    • New role-based learning paths: Hardware engineers, ML engineers, roboticists
    • Updated building sequence: Week 1 now includes hardware selection
  • Original April 2026: Created corpus + improvements

    • Original 8 documents (01–08)
    • CORPUS_AUDIT.md: Comprehensive gap analysis
    • New documents: 09 (Operations), 10 (Security), 11 (Testing), 12 (Deployment)
    • New glossary: 75+ terms defined with context
    • Index improvements: Phase labeling, consolidated metrics, clearer paths, post-deployment guidance
    • KV cache optimization techniques (GQA, PagedAttention, INT8/INT4, TurboQuant — Google Research Blog, ICLR 2026)
    • LLM Wiki pattern (compiled markdown knowledge — Karpathy’s LLM Wiki Gist, April 2026)
    • Claude Code memory architecture (Anthropic)
    • Open-source agent framework patterns

For citations and detailed sources, see individual document footers.


See Also

  • Doc 01 (Foundation Models) — Understand what models are available and when to use each; essential context for building your harness
  • Doc 06 (Harness Architecture) — Learn the seven components of a complete system; start here after understanding models
  • Doc 09 (Operations & Observability) — Master monitoring and debugging before deploying anything to production; part of critical Phase 1
  • Doc 21 (Model Fundamentals) — Dive deeper into how neural networks and transformers actually work; for researchers and those wanting deeper understanding