Zero to One Needs One to N — Part 3

Part 1 covers building the knowledge base. Part 2 covers seven iterations of failure, discovering that most AI problems are deterministic, and removing the LLM entirely — then realising I’d gone too far. This is where I put it back.

Three hundred leads. Each one a record that failed one deterministic gate but looked promising. A name match with the wrong district. A date two years off. A census entry without enough context to confirm. The deterministic system had nothing useful to say about any of them.

I needed the LLM back. But not everywhere. Not for scoring. Not for pattern matching. For one specific job: reading ambiguous evidence and reasoning about what to search next.

The Sandwich

The architecture that emerged has three layers:

Deterministic in. Python searches eight sources, scores every result through pass/fail gates, applies thirty coded patterns. Facts confirmed. Impossibilities rejected. Everything in between becomes a lead with evidence and failed gates attached.

Probabilistic middle. DeepSeek-R1 14B reads the lead — what evidence exists, what gates failed, what family context the local graph provides. It reasons step by step about what search would resolve the uncertainty. Not “check the census” but “search the 1901 census for John in Windley — if the household contains wife Sarah and son Robert, that confirms John born 1861.”

Python executes the suggested search. Scores the results through the same deterministic gates. New evidence feeds back to the LLM. It reads what came back and suggests what to try next. An iterative loop — search, score, reason, search again — until the lead is resolved or the model runs out of ideas.

Deterministic out. After the loop, every piece of accumulated evidence goes through the gates one final time. Anything that now passes all checks becomes a confirmed fact. The lead is promoted, dismissed, or left open with more evidence for next time. I review the results. No fact gets recorded without a human deciding it’s correct.

The LLM never decides what’s a fact. It only decides what to search for. Wrong suggestion? Wastes one search. Right suggestion? Unlocks a family line. The asymmetry works in your favour.

What Actually Happened

I tested on three leads from different families. George Gladwin, born 1855. Martha Caldwell, born 1842. Frank Hodgkinson, born 1877. Each found in a single census with no other records.

The LLM suggested searches across FreeBMD, FamilySearch, parish registers, probate, census. Python executed them. The scorer classified results. FamilySearch was the star — its generic search endpoint crosses every collection simultaneously. Military records, burial registers, probate, church records. One call, everything they’ve indexed.

All three resolved. George Gladwin appeared in three different census households across three decades. Martha Caldwell appeared in two Alderwasley households. Frank Hodgkinson appeared in the 1911 census in Derby. Multiple corroborating sources, all passing deterministic gates. Four facts extracted.

The sandwich worked. The LLM asked the right questions. Python validated the answers.

Then I Grouped the Leads

Three hundred individual leads investigated one at a time would take fifteen to twenty hours. But most leads clustered naturally by household. One census produced twelve leads — all children from the same family. One search for the household head could resolve all twelve at once.

296 individual leads collapsed into 57 household clusters. The LLM now sees the whole family in one prompt. One call reasons about the entire household. “Search the 1851 census for the household head — that confirms everyone in one hit.” The LLM sees connections between siblings that individual investigations would miss.

57 clusters at ten to fifteen minutes each. Maybe ten hours instead of twenty. Same cost — zero, all local. Better quality, because the LLM has richer context.

The Wall

I was pleased with this. Household clusters worked. The 14B model reasoned well about individual families, suggested sensible searches, adapted when results came back empty.

Then I thought: what if the cluster wasn’t one household but the whole tree?

296 leads, each as a one-liner. About 8,000 tokens — fits in the context window. The LLM could spot cross-family connections no household cluster would find. “Elizabeth Barker married John Cauldwell — the Barker leads and the Cauldwell leads are the same family.” “Three clusters have leads from the same 1861 parish — one search covers all of them.”

I didn’t even need to run it to know it wouldn’t work. The evidence was already in front of me.

This model couldn’t reliably suggest FamilySearch when I explicitly told it to in the system prompt. Five words: “Always include a FamilySearch search.” It ignored them half the time. I had to auto-inject the search in code because the model wouldn’t follow a direct instruction in a 5,000-token context.

A model that can’t follow an explicit instruction in a small context won’t make subtle cross-family inferences in a large one.

A 14B reasoning model is a specialist, not a generalist. Give it one household and it chains through five inference steps brilliantly. Give it fifty families and it produces generic advice. Deep reasoning about one thing — excellent. Broad reasoning about many things — not its strength.

Zero to One, One to N

Peter Thiel’s distinction between zero-to-one (creating something new) and one-to-N (scaling what exists) maps exactly to what I’d built.

The thirty Python rules are one-to-N. Take a known pattern — mother-in-law with different surname equals maiden name — and apply it instantly, reliably, at scale. This is Tim Cook territory. Operational excellence. Every record scored, every impossibility rejected, every pattern checked. It will never get a date calculation wrong. It will also never discover something new.

The LLM reasoning about a lead is zero-to-one. “Can’t disambiguate between two John Cauldwells — what if we search the 1901 census for the one in Windley?” That insight didn’t come from a rule. It came from reading uncertain evidence and making a creative leap. It might be wrong. It might waste a search. But when it’s right, it unlocks a family line that no checklist would have found.

The architecture is zero-to-one discovery layered on top of one-to-N execution. And the lesson from hitting the wall is that zero-to-one itself has tiers.

The Tiers

Tier 1: One-to-N execution. Python. Zero cost. Instant. Perfect reliability. Searches, scores, validates, applies known patterns. This is where 95% of the work happens. Every rule I codified from the LLM’s earlier discoveries runs here — instantly, reliably, at scale.

Tier 2: Focused zero-to-one. DeepSeek-R1 14B on my M4. Zero cost beyond electricity. About 9GB of RAM, 173 seconds per call. Reads one household’s evidence and reasons about what to search next. A specialist — brilliant at deep, step-by-step inference about a single focused problem. Give it one family and it’ll chain through five logical steps to suggest exactly the right search. Give it fifty families and it’ll tell you to check FreeBMD. The Steve Jobs who can revolutionise one product line but not see across the whole company.

Tier 3: Strategic zero-to-one. Claude, GPT-4, or whatever model has the breadth. Costs per token. Fast. Sees the whole tree, spots cross-family connections, prioritises clusters, makes the broad calls the specialist can’t. The Steve Jobs who sees how the iPod leads to the iPhone leads to the App Store. One or two calls to set the research agenda. Then the specialist and the executor take it from there.

Each tier handles what the one below can’t. Route up only when you hit the wall.

The Economics

This matters because it inverts the default assumption. Most AI architectures start with the biggest model and optimise down. “Use GPT-4 for everything, then find the cheap bits.”

I went the other direction. Start with code. Move to the cheapest model that handles the gap. Move to the API only for the gap the cheap model can’t cover.

Tier 1 handles the 300-lead backlog. Python confirms facts, rejects impossibilities, generates leads with evidence. Cost: zero.

Tier 2 investigates the 57 household clusters. The 14B model reads each family, suggests searches, iterates until resolved. Cost: zero. My electricity bill went up by maybe fifty pence.

Tier 3 would make one or two strategic calls — “which clusters connect, what order should I investigate them?” Perhaps a few hundred tokens. Cost: fractions of a penny.

The total API cost for a 400-person family tree: essentially nothing. Because the expensive model only touches the tiny fraction of work that genuinely needs broad intelligence. Everything else is code or a local model that’s already paid for.

What I Learned

Remove the LLM, then put it back. I can’t design the right architecture from first principles. I had to overshoot — give the LLM everything, watch it fail — then undershoot — remove it entirely, find the gap — then find the middle. The sandwich emerged from oscillation, not planning.

Zero-to-one and one-to-N are both essential. The Python rules are one-to-N — executing known patterns at scale. The LLM is zero-to-one — discovering what to search when the patterns run out. You need the LLM to find the rules. You need Python to run them. Trying to use either for the other’s job is how I spent two weeks iterating.

A 14B reasoning model is a focused specialist. Deep reasoning about one problem: excellent. Broad reasoning across many problems: mediocre. Know which problem you’re handing it. This alone would have saved me days of wrong approaches.

Route up, never down. Start with deterministic code. Move to local inference only for what code can’t handle. Move to API only for what local can’t handle. Each tier earns its place by demonstrating the tier below isn’t enough.

The LLM is a question-asker, not an answer-giver. For factual research, the AI’s job is suggesting what to search for — not confirming what’s true. Wrong question wastes one search. Wrong fact corrupts your data. Put the probabilistic layer where wrong answers are cheap.

Clustering is free intelligence. Grouping related leads gave the LLM richer context and the searches broader impact. 57 investigations instead of 300. No additional compute, better results. Sometimes the best optimisation is how you organise the input, not how you improve the model.

This started as a genealogy project. It turned into a practical education in where AI belongs — and where it doesn’t. The answer isn’t one tier. It’s knowing which tier each problem needs, and routing there only when the tier below hits its wall. Zero-to-one needs one-to-N. One-to-N needs zero-to-one. The architecture is both, in the right order.

Part 1: The Handbook · Part 2: Most AI Problems Are Deterministic · Part 3: Zero to One Needs One to N

The Harness Handbook — 29 chapters of practical AI/ML engineering, from model fundamentals through production deployment. Built from scratch, tested against reality.