I Taught Myself Agentic AI by Writing 45,000 Lines About It

A few weeks ago I wrote about how AI inverted my development process. Experience first, architecture after. Let Claude Code handle the implementation. Don’t overthink the design — you can always throw more compute at the problem later.

I believed that. Then I tried to build an AI agent, and learned that “throw more compute at it” is exactly wrong when the compute itself is the thing you’re building. You can’t experience-first your way through model selection, reasoning frameworks, and memory architecture. You need to understand what you’re building before you build it.

So I did something I haven’t done since my Sparx Enterprise Architect days. I studied first.

The Rabbit Hole

It started with a question I was embarrassed not to know the answer to: what actually is a model? Not “GPT is a chatbot” — what are weights and parameters? How does a transformer process text? What happens during training versus inference, and why does inference cost 99% of the operational budget?

That led to: how do you choose between a 7B and a 70B model? The cost difference is 10-30x. A 7B SLM runs on my MacBook. A 70B LLM needs cloud GPUs at three dollars an hour. The answer matters — and it depends on what you’re building. Agent loops need speed, so SLMs. Complex reasoning needs depth, so LLMs. Production systems need both, so you route between them.

Which led to: what turns a model into an agent? A model alone is just a text predictor. An agent needs seven components — the model, tools it can call, memory that persists across sessions, a planning loop that decides what to do next, sandboxing that prevents damage, orchestration that manages sessions, and state management that tracks progress. Remove any one and you have a chatbot.

Which led to: how does reasoning work? Not all agents think the same way. Chain-of-Thought breaks problems into explicit steps. ReAct interleaves reasoning with tool use. Tree of Thoughts explores multiple solution paths. Reflexion critiques its own output and tries again. There are at least seven distinct frameworks, each suited to different problems. Choosing wrong means your agent either overthinks simple tasks or underthinks complex ones.

Each answer opened three more questions. I kept writing.

45,000 Lines Later

Eight documents became twenty. Twenty became twenty-eight. I added hardware guides because I wanted to run models locally on my M4 MacBook with 32GB of unified memory. Could I fit a 7B model? (Yes, about 5GB quantised, leaving 27GB free.) How fast? (40-50 tokens per second through Apple’s MLX framework.) What about a 14B? (Also fits, at about 9GB.) 70B? (Not a chance.)

I added deployment patterns — Docker, Kubernetes, rollback strategies, health check endpoints. I added security — prompt injection defence, input validation, PII handling. Testing non-deterministic systems — you can’t just assert equals when the output changes every run. Cost management — token counting, budget enforcement, the economics of local versus cloud.

Then I audited the whole thing for accuracy. Found GPU TFLOPS numbers that were wrong by a factor of 1,000 in one draft. Found citations to research papers that didn’t exist — the AI had hallucinated them. Found claims about specific tools that were outdated or inaccurate. Every fact needed verification before I’d publish under my name.

I verified what I could against primary sources. TurboQuant is real — Google Research, ICLR 2026, 3-bit KV cache compression with 6x memory reduction. Karpathy’s LLM Wiki pattern is real — a three-layer alternative to vector databases for small-to-medium knowledge bases. But some content that looked authoritative was fabricated. Trust but verify.

I called it The Harness Handbook. 29 chapters, 45,000 lines, 1,970 working code examples, an 84-term glossary. Validation checklists for every chapter, cross-references linking everything together. Published as a searchable wiki section on this site.

Was it overkill? Absolutely.

Choosing a Runtime (Three Times)

With the handbook written, I started making real decisions. First: how to run a model locally.

My M4 MacBook with 32GB unified memory can comfortably run 7B-14B models. Three options:

Ollama was the obvious choice. Everyone recommends it. brew install ollama, pull a model, done. But Ollama runs a separate HTTP server — every model call is an HTTP round-trip to localhost. For interactive chat, fine. For an agent making hundreds of automated calls, that’s unnecessary overhead.

llama-cpp-python was the technical choice. In-process Python calls, no HTTP. And it has GBNF grammar constraints — you can mechanically force the model to produce valid JSON at the token level. Not “please output JSON” but “you literally cannot generate an invalid token.” For an agent that needs reliable structured output over hundreds of iterations, that’s compelling.

MLX was the match-your-hardware choice. Apple’s own framework, purpose-built for Apple Silicon unified memory. Python-native, fastest on M4, pip install mlx-lm. No grammar constraints, but Qwen 2.5 is already excellent at JSON — validate and retry handles the rare failures.

I went with Ollama first. Then reconsidered — HTTP overhead on every call, hundreds of calls. Switched to llama-cpp-python for the grammar constraints. Then reconsidered again — grammar constraints are a safety net; MLX’s speed advantage compounds over hundreds of iterations. Running Apple’s framework on Apple’s chip is the optimal pairing.

Three choices. The obvious answer was wrong, the technical answer was good but not best, and the match-your-hardware answer won.

The Hidden Economics

Building the handbook revealed something I hadn’t expected about cost — and it’s the most shareable insight from the whole project.

A purpose-built local harness uses 25x fewer tokens per call than a general-purpose LLM. Here’s why:

When you ask Claude Code to do something, every call carries overhead:

Component	Claude Code	Dedicated Harness
System prompt	~5,000 tokens (full assistant instructions)	~110 tokens (domain rules only)
Tool definitions	~3,000 tokens (20-40 tools described)	0 (Python dispatches tools directly)
Conversation history	~8,000 tokens (growing each turn)	0 (rebuilt from state each call)
File contents	~6,000 tokens (reading code/context)	0 (pre-rendered compact text)
Response	~2,000 tokens (explanatory text)	~650 tokens (structured JSON)
Total	~24,500 tokens/turn	~997 tokens/call

At scale — researching 200 ancestors in my genealogy project — that’s the difference between 1 million tokens locally (4p in electricity on my M4) and 39 million tokens via a general-purpose API ($258). A 39x reduction.

More importantly, if you’re using Claude Code on a subscription, those 39 million tokens are tokens you can’t spend on actual software development. Running the genealogy work locally preserves roughly 1,273 typical Claude Code interactions for coding. Don’t use a Swiss Army knife when you need a scalpel.

The Use Case Was Already There

With the handbook done and the runtime chosen, I needed something to build. And it was sitting in the folder next door.

I had a genealogy research project — 208 people across six generations, tracked in .corpus-state.json with sources, relationships, and citations. Seven production Python tools: freebmd.py searches civil registration records (births, deaths, marriages from 1837). familysearch.py queries census data. cwgc.py searches the Commonwealth War Graves Commission. findagrave.py, freecen.py, freereg_search.py — each one a working search tool I’d built over months.

Seven tools, 200+ people, 100+ open research tasks. Multi-step, path-dependent, fuzzy reasoning, verifiable outputs. The textbook definition of an agent problem.

I was wrong about that, in ways I didn’t expect. But that’s Part 2.

This is Part 1 of a three-part series. Part 2: Seven Iterations of Failure covers building the agent, watching it fail, and discovering where AI does and doesn’t add value. Part 3: It’s Not About Size covers the model selection breakthrough.

The Harness Handbook is available on this site — 29 chapters of practical AI/ML engineering, from model fundamentals through production deployment.