AI Agents: Reasoning Frameworks & Architecture — The Harness Handbook Reference

What Is an AI Agent?

An AI agent is a system that:

Perceives its environment (user input, current state, observations)
Reasons about options and goals using an LLM
Plans a sequence of actions or decisions
Acts by calling tools, APIs, or writing files
Observes the results
Learns by updating memory and repeating until goal achieved

Distinction from chatbots: Agents operate autonomously. They can make decisions, take actions, fail, recover, and continue work across multiple sessions without human intervention between steps.

The Agentic Loop (The Five-Stage Cycle)

Every agent system repeats this fundamental cycle:

┌─────────────────────────────────────────┐
│  1. PERCEIVE                            │
│  Gather: user intent, current state,    │
│  environment signals, observations      │
└────────────┬────────────────────────────┘
             ↓
┌─────────────────────────────────────────┐
│  2. REASON                              │
│  Model generates thoughts about what    │
│  to do next (using tools, files, etc.)  │
└────────────┬────────────────────────────┘
             ↓
┌─────────────────────────────────────────┐
│  3. PLAN                                │
│  Decide on next action(s):              │
│  - Call tool with parameters            │
│  - Write file                           │
│  - Run code                             │
│  - Make decision                        │
└────────────┬────────────────────────────┘
             ↓
┌─────────────────────────────────────────┐
│  4. ACT                                 │
│  Execute the plan:                      │
│  - Tool returns result                  │
│  - File written/read                    │
│  - Code executed                        │
└────────────┬────────────────────────────┘
             ↓
┌─────────────────────────────────────────┐
│  5. OBSERVE                             │
│  Check results:                         │
│  - Did tool work?                       │
│  - What does output mean?               │
│  - Has goal been achieved?              │
└────────────┬────────────────────────────┘
             ↓
             [LOOP back to PERCEIVE]
             OR [STOP if goal achieved]

This cycle repeats until:

Goal is achieved
Maximum iterations reached
User stops the agent
Unrecoverable error occurs

Agentic Reasoning Frameworks

Different frameworks determine how the agent approaches problem-solving. Choose based on task complexity.

Chain-of-Thought (CoT)

Pattern: Break down reasoning into explicit steps before generating output

Question: If a book costs $15 and you buy 3 books, how much do you spend?

CoT Response:
Step 1: Identify what we're calculating (total cost of multiple books)
Step 2: Note the unit cost ($15 per book)
Step 3: Identify the quantity (3 books)
Step 4: Calculate: 15 × 3 = $45
Step 5: Verify the answer makes sense

Answer: $45

Characteristics:

Simplicity: No tool use, just explicit reasoning steps
Interpretability: Output shows the reasoning process
Reliability: Better accuracy on math, logic, multi-step problems
Minimal overhead: Pure text, no external tools needed
Use case: Problems requiring step-by-step logic, verification, transparency

Limitations:

No external tool access (can’t search, fetch data, run code)
Doesn’t adapt based on feedback
Token cost increases with reasoning steps

Best for: Harnesses where interpretability matters, math/logic problems, compliance scenarios where you need to show work

Relationship to ReAct: ReAct extends CoT by adding tool use. ReAct = CoT thinking + actions + observations.

ReAct (Reasoning + Acting)

Pattern: Thought → Action → Observation → Thought → Action → …

Thought: I need to search for information about coffee cultivation
Action: Use web search tool
Observation: Found 3 relevant articles
Thought: Now I'll read the most relevant one
Action: Fetch article content
Observation: Article explains terroir affects quality
Thought: I have enough info to answer
Action: Generate response

Characteristics:

Simplicity: No complex planning, just think-act-observe
Prompt-driven: No special model training needed
Depth: Single reasoning path (can’t backtrack efficiently)
Speed: Fast, minimal overhead
Use case: Tool-use agents, information retrieval, straightforward problems

Limitations:

Commits to first action (can’t explore alternatives)
Can get stuck in loops if action doesn’t work
Not ideal for open-ended reasoning

Well-proven for: Tool-use agents, information retrieval (fast, widely adopted in production)

Tree of Thoughts (ToT)

Pattern: At each step, generate multiple possible next steps, then search through the tree

Initial State: "Write a Python function"
     ├─ Approach A: Start with types
     │  ├─ Strategy A1: Type hints first
     │  └─ Strategy A2: Runtime validation
     ├─ Approach B: Start with examples
     │  ├─ Strategy B1: Test-driven
     │  └─ Strategy B2: Documentation first
     └─ Approach C: Start with structure
        ├─ Strategy C1: Class-based
        └─ Strategy C2: Functional
        
[Search algorithms (BFS/DFS) evaluate which path seems most promising]

Characteristics:

Exploration: Generates multiple possibilities at each step
Backtracking: Can undo decisions if path fails
Comprehensive: Explores multiple solutions
Cost: Higher compute (explores multiple paths)
Use case: Complex reasoning, math problems, strategic planning

When to use:

Complex multi-step problems with multiple valid approaches
When you need to explore trade-offs (e.g., architectural decisions)
Math problems, puzzle solving
Creative tasks where multiple solutions exist

Plan-and-Execute (Opposite of ReAct)

Pattern: Plan entire strategy first, then execute sequentially

PLAN PHASE:
1. Break down problem into steps
2. Estimate resources, dependencies
3. Identify risks
4. Commit to sequence

EXECUTE PHASE:
1. Follow plan step-by-step
2. Report progress
3. Collect results

Characteristics:

Predictability: Deterministic (same inputs = same sequence)
Verification: Can validate plan before executing
Rigidity: Can’t adapt if plan becomes invalid
Use case: Well-defined tasks with stable requirements

When to use:

Software release process (pre-planned steps)
Data migration (fixed sequence)
Infrastructure provisioning (stable requirements)
Tasks where ability to inspect plan upfront is valuable

Reflexion (Self-Critique)

Pattern: Generate response → Critique quality → Revise → Repeat

Generation: "The capital of France is London"
         ↓
Critique: "This is wrong. London is UK's capital."
         ↓
Revision: "The capital of France is Paris"
         ↓
Final Check: "Correct"

Characteristics:

Self-improving: Learns from mistakes within session
Quality focus: Explicit quality gate
Overhead: Multiple passes (slower)
Use case: Writing, code generation, any creative task

When to use:

Code generation (generate → test → fix)
Creative writing (generate → critique → revise)
Any task where quality matters more than speed
Building high-quality documentation

Graph of Thoughts (GoT) — Latest (2025–2026)

Pattern: Arbitrary graph where thoughts connect, aggregate, refine

         ┌─────────┐
         │ Query   │
         └────┬────┘
              │
         ┌────┴────┐
         │    │    │
      ┌──▼─┐┌─┴──┐┌─┴──┐
      │ T1 ││ T2 ││ T3 │
      └──┬─┘└──┬─┘└──┬─┘
         │     │     │
         └────┬┴─────┤
              │      │
          ┌───▼───┐  │
          │ T1+2  │◄─┘
          └───┬───┘
              │
          ┌───▼───┐
          │ Final │
          └───────┘

Characteristics:

Flexible: Thoughts can connect in any pattern
Aggregation: Results can combine multiple paths
Iterative refinement: Thoughts refine each other
Latest research: Emerging pattern in 2025-2026
Complexity: More compute, better results

When to use:

Complex multi-faceted problems
Situations requiring synthesis of multiple viewpoints
Advanced reasoning tasks
Emergent research (stable patterns not yet established)

Multi-Agent Hierarchical (Distributed)

Pattern: Coordinator agent delegates to specialists, results bubble up

Coordinator Agent
    ├─ Research Specialist
    │   └─ Web search, read docs
    ├─ Code Specialist
    │   └─ Generate code, test
    ├─ Writing Specialist
    │   └─ Documentation, guides
    └─ Verification Specialist
        └─ Review, validate

Characteristics:

Specialization: Each agent optimized for its role
Parallelization: Agents can work simultaneously
Coordination overhead: Need message passing, state sync
Complexity: Harder to debug, manage state
Use case: Large organizations, complex business processes

When to use:

When tasks naturally decompose into specialized subtasks
Building agency-like team structures
Parallel processing (agents work simultaneously)
Scaling to complex workflows

Constrained Decoding & Structured Output

Pattern: Force model outputs into specific formats (JSON, XML, function calls) to ensure parseable responses

Without constraint:
- "Here's a json: { name: John, age: 30 }" (unparseable, extra text)
- "The user said xyz" (wrong format entirely)

With constraint:
- {"name": "John", "age": 30} (guaranteed valid JSON)
- Constrained to exactly one of: [option_A, option_B, option_C]

Methods:

JSON schema validation: Specify exact format upfront, model is guided to produce valid JSON
Grammar constraints: Use formal grammars (GBNF) to restrict output tokens
Token masking: Disable invalid tokens during generation
Post-generation validation: Check output, retry if invalid (less efficient)

Characteristics:

Reliability: Eliminates parsing errors, JSON hallucinations
Cost: Slightly higher (constraints limit beam search width)
Latency: Minimal impact (constraints during generation, no post-processing)
Use case: Tool calling, structured APIs, requirement validation

When to use:

Tool/function calling (must have parseable parameters)
Classification into finite options
Generating structured data (forms, tables, databases)
APIs expecting exact format
Production systems where parsing must never fail

Implementation: Use max_tokens, stop_sequences, or Claude’s native tool_use block with guaranteed structure.

Self-Correction (Distinct from Reflexion)

Pattern: Model generates output, validates it, then corrects mistakes iteratively

Iteration 1: Generate solution
             → Check: "Does this solution work?"
             
Iteration 2: If validation failed → Correct and retry
             → Check: "Is this better?"
             
Iteration 3: Accept when valid OR max iterations reached

Difference from Reflexion:

Reflexion: Involves external critic or separate validation model (slower, higher cost, higher quality)
Self-Correction: Same model corrects itself (faster, lower cost, good for known mistake patterns)

Examples:

Code: Generate code → Run tests → Fix failing tests
Math: Generate answer → Check arithmetic → Recalculate if wrong
Extraction: Parse data → Validate against schema → Retry parsing if invalid

Characteristics:

Simplicity: Single model, no external dependencies
Iteration count: Typically 2-3 rounds until convergence
Cost: Multiple generations, but usually cheaper than Reflexion
Quality: Good for predictable mistakes, not for deep conceptual errors

When to use:

When you can validate outputs programmatically (tests, schemas, rules)
For iterative refinement without external review
Time-critical scenarios (faster than full Reflexion)
Pattern-specific corrections (known failure modes)

Agent Types by Function

Autonomous Agents

Operate independently with minimal human oversight
Make decisions within learned constraints
Example: Autonomous trading agent, recommendation engine

Assistants

Support human decision-making
Provide information, suggestions, alternatives
Human retains final decision authority
Example: Claude Code, GitHub Copilot, ChatGPT

Co-Workers

Humans and agents work side-by-side
Shared responsibility for outcomes
Dynamic role switching
Example: Pair programming, design critique partner

Tool-Using Agents

Specialized in calling external tools/APIs
Each tool invocation is a decision point
Good for information retrieval, data manipulation
Example: Open-source agent frameworks with skills, ReAct agents, web search agents

Reasoning Frameworks Comparison Reference

This table is the authoritative reference for all reasoning frameworks. Use it to choose, compare, or understand frameworks used elsewhere in the handbook.

Framework	Simplicity	Speed	Cost	Best For	Examples
Chain-of-Thought (CoT)	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	$	Math, logic, step-by-step reasoning; interpretability needed	”Step 1: calculate…, Step 2: verify…”
ReAct	⭐⭐⭐⭐	⭐⭐⭐⭐	$$	Tool-using agents, information retrieval, web tasks	Claude Code, web search agents, API callers
Tree of Thoughts (ToT)	⭐⭐⭐	⭐⭐⭐	$$$	Complex multi-path problems, optimization, strategy	Chess moves, architectural decisions, puzzles
Plan-and-Execute	⭐⭐⭐⭐	⭐⭐⭐	$$	Well-defined sequential tasks, release processes	Software releases, data migrations, provisioning
Reflexion	⭐⭐⭐	⭐⭐	$$$$	Quality-critical work, perfectionism needed	Code generation with testing, creative writing review
Graph of Thoughts (GoT)	⭐⭐	⭐⭐	$$$$	Multi-perspective synthesis, emergent reasoning	Research synthesis, novel problem-solving
Hierarchical (Multi-Agent)	⭐⭐	⭐⭐	$$$	Task decomposition, team-like coordination	Complex projects, specialized subtasks, organizations

Legend:

Simplicity: Easier to implement and understand (5 stars = simplest)
Speed: Token efficiency and latency (5 stars = fastest)
Cost: Relative computational cost ($ = cheapest, $$$$ = most expensive)
Best For: Primary use cases and real-world applications

Cross-references:

Detailed explanations of each framework in sections above
Production patterns in doc 06 (Harness Architecture)
Implementation patterns in doc 14 (Advanced Patterns)
Troubleshooting stuck patterns in doc 18 (Troubleshooting & FAQ)
Prompt design in doc 15 (Prompt Engineering)

Recommended Framework Selection for Your Harness

Harness Type	Recommended Framework	Reasoning
Assistant (tool-use)	ReAct	Speed, simplicity, proven in production (Claude Code)
Long-running feature building	Plan-and-Execute	Predictability, can inspect plan before executing
Code generation	Reflexion	Quality gate, test before accepting
Complex reasoning	Tree of Thoughts	Can explore alternatives, backtrack
Multi-step specialized tasks	Hierarchical	Delegate to specialists
Research/discovery	Graph of Thoughts	Multiple reasoning paths converge

Implementation Checklist

When NOT to Use an Agent

Not every AI task needs an agent. The agentic loop is powerful, but it introduces probabilistic decision-making at every step. Some tasks need certainty, not probability.

The Core Framing: Probabilistic vs Deterministic

LLMs produce probability. Every output is a best guess based on training data and context. This is a feature for creative reasoning, strategy, and natural language understanding. It is a liability for tasks that have exact answers.

Do Not Use an Agent When…

The task is actually computation, not reasoning. Date arithmetic, geographic distance calculation, string matching, record deduplication with clean fields — these have exact answers. Python computes them perfectly. An LLM introduces errors.

# LLM might get this wrong:
"Is 1871 within 2 years of 1887?"  → LLM: "Yes" (wrong)

# Python always gets this right:
abs(1871 - 1887) <= 2  → False

Structured data matching has clean fields. If you have two database records with name, date, and location fields, and you need to decide if they match — probability introduces errors. Write matching logic in Python.

Errors compound. When an agent records a wrong fact, all subsequent decisions build on that error. If your task is a chain where each step depends on the previous one being correct, keep a human in the loop for verification. One wrong genealogical connection, one misidentified legal precedent, one incorrect medical correlation — and the entire downstream analysis is corrupted.

The Research Companion Alternative

Instead of a full agent (LLM decides everything), consider the Research Companion pattern (Doc 14): the LLM advises what to investigate, Python executes the search, and a human decides what is true.

A wrong question wastes one search (low cost)
A wrong answer corrupts data (high cost)

Apply the LLM to questions, not answers, when accuracy matters.

Rule of Thumb

If you can write the logic as an if-statement or formula, do not use an LLM. Reserve the LLM for tasks that genuinely require language understanding, creative reasoning, or judgement under ambiguity.

Task	Use LLM?	Why
”Is this email spam or not?”	Yes	Requires language understanding
”Is 1871 within 2 years of 1887?”	No	Simple arithmetic
”What research strategy should I try next?”	Yes	Requires creative reasoning
”Do these two records refer to the same person?” (clean fields)	No	Deterministic field comparison
”Do these two records refer to the same person?” (messy, incomplete data)	Maybe	Judgement under ambiguity, but validate carefully
”Summarise this document”	Yes	Language understanding
”Calculate the distance between two coordinates”	No	Formula

Validation Checklist

How do you know you got this right?

Performance Checks

Agent completes simple task (3-5 steps) in <30 seconds
Loop iterations stable: most tasks <10 iterations, max 15
No infinite loops: iteration limit enforced and working
Token efficiency: task completion <50K tokens (not bloated prompts)

Implementation Checks

Chose framework based on task type (ReAct for tools, ToT for multi-path)
Perception working: agent can read current state and user input
Actions execute: at least 3 tools called successfully in a session
Observations captured: agent learns from tool results, adjusts next action
Stopping conditions defined: goal detection or max iterations works
Tested framework on 5+ different task types
Know which framework strengths/weaknesses apply to your domain

Integration Checks

Agentic loop integrates with memory system (working memory accumulates)
Tool calls validate against tool registry (tool_name + input schema)
Error handling in place: failed tools don’t crash loop, agent continues
Observation loop closes: agent sees tool results before deciding next action

Common Failure Modes

Agent stuck repeating same action: Error message unclear; improve tool error text
Framework overkill for simple task: Using ToT for binary decision; switch to ReAct
Observations not feeding back: Tool results not included in next LLM prompt
No stopping logic: Agent doesn’t know when to stop; add termination condition check
Framework mismatch: Picked Plan-and-Execute for dynamic discovery task; use ReAct

Sign-Off Criteria

Chosen framework tested and validated for your use case
Agent completes 5 representative tasks successfully
Loop iterations and token usage within budgets
Framework comparison done: why is this better than alternatives?
Edge cases tested: what does agent do with ambiguous input? missing tool?

2025–2026 Outlook

ReAct and Plan-Execute form the foundational patterns (proven, stable)
Research focus: Overcoming limitations with more sophisticated architectures
Production systems: Multi-agent teams working as specialized units
Trend: SLMs over LLMs for agentic loops (speed critical, latency intolerant)
Emerging: Graph of Thoughts and GoT variants for complex reasoning

What Is an AI Agent?

The Agentic Loop (The Five-Stage Cycle)

Agentic Reasoning Frameworks

Chain-of-Thought (CoT)

ReAct (Reasoning + Acting)

Tree of Thoughts (ToT)

Plan-and-Execute (Opposite of ReAct)

Reflexion (Self-Critique)

Graph of Thoughts (GoT) — Latest (2025–2026)

Multi-Agent Hierarchical (Distributed)

Constrained Decoding & Structured Output

Self-Correction (Distinct from Reflexion)

Agent Types by Function

Autonomous Agents

Assistants

Co-Workers

Tool-Using Agents

Reasoning Frameworks Comparison Reference

Recommended Framework Selection for Your Harness

Implementation Checklist

When NOT to Use an Agent

The Core Framing: Probabilistic vs Deterministic

Do Not Use an Agent When…

The Research Companion Alternative

Rule of Thumb

Validation Checklist

Performance Checks

Implementation Checks

Integration Checks

Common Failure Modes

Sign-Off Criteria

See Also

2025–2026 Outlook