Most AI Problems Are Deterministic — Part 2

Part 1 covers building the knowledge base. This is where I find out whether any of it was useful.

I have a genealogy project. 208 people across six generations of my Derbyshire family tree. Python scripts search FreeBMD for civil registration, FamilySearch for census data, and the Commonwealth War Graves Commission for WW1 casualties. Seven source libraries, all working.

What I didn’t have was the thinking layer. The bit that looks at three search results for “Ernest Cauldwell” and decides which one is actually my great-uncle.

Genealogy seemed perfect for an AI agent. Multi-step reasoning. Fuzzy matching. Verifiable outputs. Just add intelligence.

Every Fix Had the Same Shape

I built a four-phase pipeline: search sources, correlate results with an LLM, record confirmed facts, branch to new leads. Qwen 2.5 7B Instruct, running locally on my M4 through MLX. I tested against a person I already knew the answers for — so I’d see exactly where the agent got it wrong.

The first run matched a dead man to a 1957 marriage record. Ernest died in 1918. In France. In a war. The LLM didn’t do the date arithmetic. It labelled someone born in 1871 as a child of someone born in 1887. And it invented four family members — Sarah, John, Mary, Thomas — who appeared nowhere in the data. Pure hallucination, because a census household should have parents, and guessing felt more productive than admitting it didn’t know.

I stared at the output and thought: this is worse than useless. A wrong fact with a confident source citation looks exactly like a right one.

I fixed each failure. Every fix was the same move: take something away from the LLM, give it to Python.

Date arithmetic? Python pre-annotates results before the LLM sees them. Geographic matching? Python lookup table — Belper district covers Turnditch. Temporal impossibilities? Python validator, never wrong. Record format parsing? Python, search-type-aware. Match scoring? Python weighted algorithm — name similarity, date proximity, district, family context.

Seven iterations. Each one better. Each one with the same diagnostic: I’d given the LLM a deterministic task. Every time I replaced it with code, accuracy went up and cost went down.

By run 7, the architecture was: Python searches, Python scores, Python validates. The LLM’s remaining job was reading family context and suggesting what to research next. About 5% of the cognitive work — but the 5% Python couldn’t do.

The Martha Barker Test

I had a perfect test for that remaining 5%. In Ernest Cauldwell’s 1891 census household, Martha Barker is listed as “Mother-in-Law.” In census terminology, relationships are to the Head — John. So Mother-in-Law means: mother of John’s wife Elizabeth. If Elizabeth’s mother is Martha Barker, then Elizabeth’s maiden name is Barker.

Five steps of reasoning. A human genealogist spots it in seconds. None of my models could make even the first inference unprompted.

Qwen 7B Instruct: only spotted it when directly asked. Qwen 14B Instruct: said Martha’s background “might provide clues about Elizabeth’s maiden name.” Almost. But the explicit conclusion — “Barker is the maiden name” — never appeared. The five-step chain was too long for an instruction model to complete unprompted, even at 14B.

I was about to conclude that local models simply couldn’t do multi-step inference without hand-holding. Build Python pattern detectors instead — “if mother-in-law has different surname, that’s a maiden name clue” — and ask the LLM the specific question. Viable, but it defeats the purpose. I’d be programming the insights rather than letting the model find them.

Then I realised I’d been testing the wrong type of model entirely.

It Wasn’t the Size. It Was the Type.

Every model I’d tested was an instruction-tuned text generator — trained to follow directions and predict the most likely next token. Good at fluent text. Not trained for multi-step logical reasoning.

Reasoning models are a different category. Trained to chain through logic steps before answering. They think, then respond. The distinction sounds academic. It isn’t. Instruction models predict the next word. Reasoning models solve the problem first, then write the answer. DeepSeek-R1, QwQ, OpenAI’s o1/o3 — these are reasoning models. Qwen Instruct, Llama Instruct, Mistral — instruction models.

DeepSeek-R1-Distill-Qwen-14B. Same parameter count as the Qwen 14B Instruct I’d already tested. Same hardware. Same prompt. Same family.

“the mother-in-law is Martha Barker… Martha is the mother-in-law, so she would be Elizabeth’s mother. So, perhaps Elizabeth’s full name was Elizabeth Barker, and she married John Cauldwell. That could be a starting point for searching.”

All five inference steps. No prompting. No hand-holding. Same size model, same hardware, completely different result. It was slower — 173 seconds versus 50 for Qwen 14B Instruct. It genuinely takes time to think. But being right is everything.

Model selection isn’t about size. It’s about what the model was trained to do.

Then I Removed It Entirely

DeepSeek-R1 made the maiden name inference. It also spotted child gaps suggesting infant deaths, military-age sons worth checking in war records, and missing family members between censuses.

I was impressed. Then I looked at what it had actually done.

Every one of these insights was a pattern. Mother-in-law with different surname equals maiden name. Gap of 4+ years between children suggests infant deaths. Men born 1880-1900 might appear in CWGC records. A genealogist doesn’t reason through these conclusions fresh each time — they recognise patterns they’ve seen a thousand times.

So I coded them. Thirty rules in a Python class. Every insight DeepSeek-R1 had generated, turned into deterministic checks.

The result was humbling. The Python analyser produced the same maiden name insight — instantly, reliably, every time. DeepSeek-R1 took 173 seconds and sometimes missed it.

I replaced the scoring with pass/fail gates. A record either passes all checks — name, date, geography — or it doesn’t. No probabilistic thresholds. No “0.88 confidence.” Fact or lead or impossible.

The final system: Python searches eight sources, scores through deterministic gates, analyses with thirty coded patterns. No LLM in the runtime. Zero wrong facts. Instant.

Deterministic Problems in Reasoning Costumes

Here’s the deeper lesson that goes beyond genealogy.

Every decision point in this project was the same question: is this problem deterministic or probabilistic? I kept getting the answer wrong.

“Is 1871 within 2 years of 1887?” I gave it to the LLM. It’s arithmetic. “Is this the same Ernest Cauldwell?” I gave it to the LLM. It’s a weighted score. “Martha Barker is Mother-in-Law — what does that mean?” I gave it to the LLM. It’s a pattern match.

I kept assuming these problems were probabilistic because they felt like reasoning. I spent two days and tested five models trying to get an AI to spot something that took three lines of Python. A genealogist looking at a census and deducing a maiden name looks like intelligence. But it’s not — it’s pattern recognition with domain knowledge. The pattern is: mother-in-law with different surname equals maiden name. A genealogist doesn’t reason their way to this conclusion every time. They recognise it instantly because they’ve seen it a thousand times.

The question I wish I’d asked from day one: is this actually probabilistic, or is it deterministic and I just don’t know the rules yet?

I needed the LLM to discover the rules. That was its real contribution — not running the patterns in production, but demonstrating them so I could codify them. The AI was a teacher, not a worker. It showed me the patterns. I turned them into Python.

The Gap I Didn’t See Coming

I had a system that confirmed facts and rejected impossibilities. Brilliant at both. I was pleased with it.

Then I looked at the backlog.

Three hundred records that failed one gate but looked promising. A name match with the wrong district. A date two years off. A census entry without enough context to confirm. Not facts. Not impossible. Leads. And the deterministic analyser’s response to every lead was the same generic checklist: “search FreeBMD births,” “check the census,” “try Find a Grave.” The same suggestions regardless of what made each lead uncertain.

Three hundred leads. Each with an identical checklist. None investigated. The system was brilliant at the easy cases and had nothing to say about the hard ones.

What I needed was something that could read the specific uncertainty — “can’t disambiguate between two John Cauldwells born a decade apart in the same district” — and reason about what specific search would resolve it. Not “check the census” but “search the 1901 census for John in Windley — if the household contains wife Sarah and son Robert, that confirms John born 1861, not John born 1841.”

That’s not pattern matching. That’s reasoning about ambiguous evidence. The one thing I’d proved the reasoning model was good at.

The one thing I’d just removed.

I’d been right to strip out the LLM. And I’d gone too far.

Part 3 puts it back — sandwiched between two layers that never get things wrong.

Part 1: The Handbook · Part 2: Most AI Problems Are Deterministic · Part 3: The Sandwich

The Harness Handbook — 29 chapters of practical AI/ML engineering, from model fundamentals through production deployment. Built from scratch, tested against reality.