Skip to main content
Reference

AI Agents: Reasoning Frameworks & Architecture

Seven reasoning frameworks compared — Chain-of-Thought, ReAct, Tree of Thoughts, Plan-and-Execute, Reflexion, constrained decoding, self-correction.

What Is an AI Agent?

An AI agent is a system that:

  1. Perceives its environment (user input, current state, observations)
  2. Reasons about options and goals using an LLM
  3. Plans a sequence of actions or decisions
  4. Acts by calling tools, APIs, or writing files
  5. Observes the results
  6. Learns by updating memory and repeating until goal achieved

Distinction from chatbots: Agents operate autonomously. They can make decisions, take actions, fail, recover, and continue work across multiple sessions without human intervention between steps.

The Agentic Loop (The Five-Stage Cycle)

Every agent system repeats this fundamental cycle:

┌─────────────────────────────────────────┐
│  1. PERCEIVE                            │
│  Gather: user intent, current state,    │
│  environment signals, observations      │
└────────────┬────────────────────────────┘

┌─────────────────────────────────────────┐
│  2. REASON                              │
│  Model generates thoughts about what    │
│  to do next (using tools, files, etc.)  │
└────────────┬────────────────────────────┘

┌─────────────────────────────────────────┐
│  3. PLAN                                │
│  Decide on next action(s):              │
│  - Call tool with parameters            │
│  - Write file                           │
│  - Run code                             │
│  - Make decision                        │
└────────────┬────────────────────────────┘

┌─────────────────────────────────────────┐
│  4. ACT                                 │
│  Execute the plan:                      │
│  - Tool returns result                  │
│  - File written/read                    │
│  - Code executed                        │
└────────────┬────────────────────────────┘

┌─────────────────────────────────────────┐
│  5. OBSERVE                             │
│  Check results:                         │
│  - Did tool work?                       │
│  - What does output mean?               │
│  - Has goal been achieved?              │
└────────────┬────────────────────────────┘

             [LOOP back to PERCEIVE]
             OR [STOP if goal achieved]

This cycle repeats until:

  • Goal is achieved
  • Maximum iterations reached
  • User stops the agent
  • Unrecoverable error occurs

Agentic Reasoning Frameworks

Different frameworks determine how the agent approaches problem-solving. Choose based on task complexity.

Chain-of-Thought (CoT)

Pattern: Break down reasoning into explicit steps before generating output

Question: If a book costs $15 and you buy 3 books, how much do you spend?

CoT Response:
Step 1: Identify what we're calculating (total cost of multiple books)
Step 2: Note the unit cost ($15 per book)
Step 3: Identify the quantity (3 books)
Step 4: Calculate: 15 × 3 = $45
Step 5: Verify the answer makes sense

Answer: $45

Characteristics:

  • Simplicity: No tool use, just explicit reasoning steps
  • Interpretability: Output shows the reasoning process
  • Reliability: Better accuracy on math, logic, multi-step problems
  • Minimal overhead: Pure text, no external tools needed
  • Use case: Problems requiring step-by-step logic, verification, transparency

Limitations:

  • No external tool access (can’t search, fetch data, run code)
  • Doesn’t adapt based on feedback
  • Token cost increases with reasoning steps

Best for: Harnesses where interpretability matters, math/logic problems, compliance scenarios where you need to show work

Relationship to ReAct: ReAct extends CoT by adding tool use. ReAct = CoT thinking + actions + observations.

ReAct (Reasoning + Acting)

Pattern: Thought → Action → Observation → Thought → Action → …

Thought: I need to search for information about coffee cultivation
Action: Use web search tool
Observation: Found 3 relevant articles
Thought: Now I'll read the most relevant one
Action: Fetch article content
Observation: Article explains terroir affects quality
Thought: I have enough info to answer
Action: Generate response

Characteristics:

  • Simplicity: No complex planning, just think-act-observe
  • Prompt-driven: No special model training needed
  • Depth: Single reasoning path (can’t backtrack efficiently)
  • Speed: Fast, minimal overhead
  • Use case: Tool-use agents, information retrieval, straightforward problems

Limitations:

  • Commits to first action (can’t explore alternatives)
  • Can get stuck in loops if action doesn’t work
  • Not ideal for open-ended reasoning

Well-proven for: Tool-use agents, information retrieval (fast, widely adopted in production)

Tree of Thoughts (ToT)

Pattern: At each step, generate multiple possible next steps, then search through the tree

Initial State: "Write a Python function"
     ├─ Approach A: Start with types
     │  ├─ Strategy A1: Type hints first
     │  └─ Strategy A2: Runtime validation
     ├─ Approach B: Start with examples
     │  ├─ Strategy B1: Test-driven
     │  └─ Strategy B2: Documentation first
     └─ Approach C: Start with structure
        ├─ Strategy C1: Class-based
        └─ Strategy C2: Functional
        
[Search algorithms (BFS/DFS) evaluate which path seems most promising]

Characteristics:

  • Exploration: Generates multiple possibilities at each step
  • Backtracking: Can undo decisions if path fails
  • Comprehensive: Explores multiple solutions
  • Cost: Higher compute (explores multiple paths)
  • Use case: Complex reasoning, math problems, strategic planning

When to use:

  • Complex multi-step problems with multiple valid approaches
  • When you need to explore trade-offs (e.g., architectural decisions)
  • Math problems, puzzle solving
  • Creative tasks where multiple solutions exist

Plan-and-Execute (Opposite of ReAct)

Pattern: Plan entire strategy first, then execute sequentially

PLAN PHASE:
1. Break down problem into steps
2. Estimate resources, dependencies
3. Identify risks
4. Commit to sequence

EXECUTE PHASE:
1. Follow plan step-by-step
2. Report progress
3. Collect results

Characteristics:

  • Predictability: Deterministic (same inputs = same sequence)
  • Verification: Can validate plan before executing
  • Rigidity: Can’t adapt if plan becomes invalid
  • Use case: Well-defined tasks with stable requirements

When to use:

  • Software release process (pre-planned steps)
  • Data migration (fixed sequence)
  • Infrastructure provisioning (stable requirements)
  • Tasks where ability to inspect plan upfront is valuable

Reflexion (Self-Critique)

Pattern: Generate response → Critique quality → Revise → Repeat

Generation: "The capital of France is London"

Critique: "This is wrong. London is UK's capital."

Revision: "The capital of France is Paris"

Final Check: "Correct"

Characteristics:

  • Self-improving: Learns from mistakes within session
  • Quality focus: Explicit quality gate
  • Overhead: Multiple passes (slower)
  • Use case: Writing, code generation, any creative task

When to use:

  • Code generation (generate → test → fix)
  • Creative writing (generate → critique → revise)
  • Any task where quality matters more than speed
  • Building high-quality documentation

Graph of Thoughts (GoT) — Latest (2025–2026)

Pattern: Arbitrary graph where thoughts connect, aggregate, refine

         ┌─────────┐
         │ Query   │
         └────┬────┘

         ┌────┴────┐
         │    │    │
      ┌──▼─┐┌─┴──┐┌─┴──┐
      │ T1 ││ T2 ││ T3 │
      └──┬─┘└──┬─┘└──┬─┘
         │     │     │
         └────┬┴─────┤
              │      │
          ┌───▼───┐  │
          │ T1+2  │◄─┘
          └───┬───┘

          ┌───▼───┐
          │ Final │
          └───────┘

Characteristics:

  • Flexible: Thoughts can connect in any pattern
  • Aggregation: Results can combine multiple paths
  • Iterative refinement: Thoughts refine each other
  • Latest research: Emerging pattern in 2025-2026
  • Complexity: More compute, better results

When to use:

  • Complex multi-faceted problems
  • Situations requiring synthesis of multiple viewpoints
  • Advanced reasoning tasks
  • Emergent research (stable patterns not yet established)

Multi-Agent Hierarchical (Distributed)

Pattern: Coordinator agent delegates to specialists, results bubble up

Coordinator Agent
    ├─ Research Specialist
    │   └─ Web search, read docs
    ├─ Code Specialist
    │   └─ Generate code, test
    ├─ Writing Specialist
    │   └─ Documentation, guides
    └─ Verification Specialist
        └─ Review, validate

Characteristics:

  • Specialization: Each agent optimized for its role
  • Parallelization: Agents can work simultaneously
  • Coordination overhead: Need message passing, state sync
  • Complexity: Harder to debug, manage state
  • Use case: Large organizations, complex business processes

When to use:

  • When tasks naturally decompose into specialized subtasks
  • Building agency-like team structures
  • Parallel processing (agents work simultaneously)
  • Scaling to complex workflows

Constrained Decoding & Structured Output

Pattern: Force model outputs into specific formats (JSON, XML, function calls) to ensure parseable responses

Without constraint:
- "Here's a json: { name: John, age: 30 }" (unparseable, extra text)
- "The user said xyz" (wrong format entirely)

With constraint:
- {"name": "John", "age": 30} (guaranteed valid JSON)
- Constrained to exactly one of: [option_A, option_B, option_C]

Methods:

  • JSON schema validation: Specify exact format upfront, model is guided to produce valid JSON
  • Grammar constraints: Use formal grammars (GBNF) to restrict output tokens
  • Token masking: Disable invalid tokens during generation
  • Post-generation validation: Check output, retry if invalid (less efficient)

Characteristics:

  • Reliability: Eliminates parsing errors, JSON hallucinations
  • Cost: Slightly higher (constraints limit beam search width)
  • Latency: Minimal impact (constraints during generation, no post-processing)
  • Use case: Tool calling, structured APIs, requirement validation

When to use:

  • Tool/function calling (must have parseable parameters)
  • Classification into finite options
  • Generating structured data (forms, tables, databases)
  • APIs expecting exact format
  • Production systems where parsing must never fail

Implementation: Use max_tokens, stop_sequences, or Claude’s native tool_use block with guaranteed structure.

Self-Correction (Distinct from Reflexion)

Pattern: Model generates output, validates it, then corrects mistakes iteratively

Iteration 1: Generate solution
             → Check: "Does this solution work?"
             
Iteration 2: If validation failed → Correct and retry
             → Check: "Is this better?"
             
Iteration 3: Accept when valid OR max iterations reached

Difference from Reflexion:

  • Reflexion: Involves external critic or separate validation model (slower, higher cost, higher quality)
  • Self-Correction: Same model corrects itself (faster, lower cost, good for known mistake patterns)

Examples:

  • Code: Generate code → Run tests → Fix failing tests
  • Math: Generate answer → Check arithmetic → Recalculate if wrong
  • Extraction: Parse data → Validate against schema → Retry parsing if invalid

Characteristics:

  • Simplicity: Single model, no external dependencies
  • Iteration count: Typically 2-3 rounds until convergence
  • Cost: Multiple generations, but usually cheaper than Reflexion
  • Quality: Good for predictable mistakes, not for deep conceptual errors

When to use:

  • When you can validate outputs programmatically (tests, schemas, rules)
  • For iterative refinement without external review
  • Time-critical scenarios (faster than full Reflexion)
  • Pattern-specific corrections (known failure modes)

Agent Types by Function

Autonomous Agents

  • Operate independently with minimal human oversight
  • Make decisions within learned constraints
  • Example: Autonomous trading agent, recommendation engine

Assistants

  • Support human decision-making
  • Provide information, suggestions, alternatives
  • Human retains final decision authority
  • Example: Claude Code, GitHub Copilot, ChatGPT

Co-Workers

  • Humans and agents work side-by-side
  • Shared responsibility for outcomes
  • Dynamic role switching
  • Example: Pair programming, design critique partner

Tool-Using Agents

  • Specialized in calling external tools/APIs
  • Each tool invocation is a decision point
  • Good for information retrieval, data manipulation
  • Example: Open-source agent frameworks with skills, ReAct agents, web search agents

Reasoning Frameworks Comparison Reference

This table is the authoritative reference for all reasoning frameworks. Use it to choose, compare, or understand frameworks used elsewhere in the handbook.

FrameworkSimplicitySpeedCostBest ForExamples
Chain-of-Thought (CoT)⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐$Math, logic, step-by-step reasoning; interpretability needed”Step 1: calculate…, Step 2: verify…”
ReAct⭐⭐⭐⭐⭐⭐⭐⭐$$Tool-using agents, information retrieval, web tasksClaude Code, web search agents, API callers
Tree of Thoughts (ToT)⭐⭐⭐⭐⭐⭐$$$Complex multi-path problems, optimization, strategyChess moves, architectural decisions, puzzles
Plan-and-Execute⭐⭐⭐⭐⭐⭐⭐$$Well-defined sequential tasks, release processesSoftware releases, data migrations, provisioning
Reflexion⭐⭐⭐⭐⭐$$$$Quality-critical work, perfectionism neededCode generation with testing, creative writing review
Graph of Thoughts (GoT)⭐⭐⭐⭐$$$$Multi-perspective synthesis, emergent reasoningResearch synthesis, novel problem-solving
Hierarchical (Multi-Agent)⭐⭐⭐⭐$$$Task decomposition, team-like coordinationComplex projects, specialized subtasks, organizations

Legend:

  • Simplicity: Easier to implement and understand (5 stars = simplest)
  • Speed: Token efficiency and latency (5 stars = fastest)
  • Cost: Relative computational cost ($ = cheapest, $$$$ = most expensive)
  • Best For: Primary use cases and real-world applications

Cross-references:

  • Detailed explanations of each framework in sections above
  • Production patterns in doc 06 (Harness Architecture)
  • Implementation patterns in doc 14 (Advanced Patterns)
  • Troubleshooting stuck patterns in doc 18 (Troubleshooting & FAQ)
  • Prompt design in doc 15 (Prompt Engineering)

Harness TypeRecommended FrameworkReasoning
Assistant (tool-use)ReActSpeed, simplicity, proven in production (Claude Code)
Long-running feature buildingPlan-and-ExecutePredictability, can inspect plan before executing
Code generationReflexionQuality gate, test before accepting
Complex reasoningTree of ThoughtsCan explore alternatives, backtrack
Multi-step specialized tasksHierarchicalDelegate to specialists
Research/discoveryGraph of ThoughtsMultiple reasoning paths converge

Implementation Checklist

  • Choose primary framework (ReAct recommended for starting)
  • Implement perception layer (gather current state, observations)
  • Define tool set (what can agent call/do)
  • Implement action execution (tool calls, file writes, etc.)
  • Set up observation capture (what to learn from results)
  • Define stopping conditions (goal achieved? max iterations? error?)
  • Add memory integration (learn from observations)
  • Test in simple scenario before scaling
  • Monitor loop iterations (are agents getting stuck?)
  • Implement backoff/retry logic (fail gracefully)

When NOT to Use an Agent

Not every AI task needs an agent. The agentic loop is powerful, but it introduces probabilistic decision-making at every step. Some tasks need certainty, not probability.

The Core Framing: Probabilistic vs Deterministic

LLMs produce probability. Every output is a best guess based on training data and context. This is a feature for creative reasoning, strategy, and natural language understanding. It is a liability for tasks that have exact answers.

Do Not Use an Agent When…

The task is actually computation, not reasoning. Date arithmetic, geographic distance calculation, string matching, record deduplication with clean fields — these have exact answers. Python computes them perfectly. An LLM introduces errors.

# LLM might get this wrong:
"Is 1871 within 2 years of 1887?"  → LLM: "Yes" (wrong)

# Python always gets this right:
abs(1871 - 1887) <= 2  → False

Structured data matching has clean fields. If you have two database records with name, date, and location fields, and you need to decide if they match — probability introduces errors. Write matching logic in Python.

Errors compound. When an agent records a wrong fact, all subsequent decisions build on that error. If your task is a chain where each step depends on the previous one being correct, keep a human in the loop for verification. One wrong genealogical connection, one misidentified legal precedent, one incorrect medical correlation — and the entire downstream analysis is corrupted.

The Research Companion Alternative

Instead of a full agent (LLM decides everything), consider the Research Companion pattern (Doc 14): the LLM advises what to investigate, Python executes the search, and a human decides what is true.

  • A wrong question wastes one search (low cost)
  • A wrong answer corrupts data (high cost)

Apply the LLM to questions, not answers, when accuracy matters.

Rule of Thumb

If you can write the logic as an if-statement or formula, do not use an LLM. Reserve the LLM for tasks that genuinely require language understanding, creative reasoning, or judgement under ambiguity.

TaskUse LLM?Why
”Is this email spam or not?”YesRequires language understanding
”Is 1871 within 2 years of 1887?”NoSimple arithmetic
”What research strategy should I try next?”YesRequires creative reasoning
”Do these two records refer to the same person?” (clean fields)NoDeterministic field comparison
”Do these two records refer to the same person?” (messy, incomplete data)MaybeJudgement under ambiguity, but validate carefully
”Summarise this document”YesLanguage understanding
”Calculate the distance between two coordinates”NoFormula

Validation Checklist

How do you know you got this right?

Performance Checks

  • Agent completes simple task (3-5 steps) in <30 seconds
  • Loop iterations stable: most tasks <10 iterations, max 15
  • No infinite loops: iteration limit enforced and working
  • Token efficiency: task completion <50K tokens (not bloated prompts)

Implementation Checks

  • Chose framework based on task type (ReAct for tools, ToT for multi-path)
  • Perception working: agent can read current state and user input
  • Actions execute: at least 3 tools called successfully in a session
  • Observations captured: agent learns from tool results, adjusts next action
  • Stopping conditions defined: goal detection or max iterations works
  • Tested framework on 5+ different task types
  • Know which framework strengths/weaknesses apply to your domain

Integration Checks

  • Agentic loop integrates with memory system (working memory accumulates)
  • Tool calls validate against tool registry (tool_name + input schema)
  • Error handling in place: failed tools don’t crash loop, agent continues
  • Observation loop closes: agent sees tool results before deciding next action

Common Failure Modes

  • Agent stuck repeating same action: Error message unclear; improve tool error text
  • Framework overkill for simple task: Using ToT for binary decision; switch to ReAct
  • Observations not feeding back: Tool results not included in next LLM prompt
  • No stopping logic: Agent doesn’t know when to stop; add termination condition check
  • Framework mismatch: Picked Plan-and-Execute for dynamic discovery task; use ReAct

Sign-Off Criteria

  • Chosen framework tested and validated for your use case
  • Agent completes 5 representative tasks successfully
  • Loop iterations and token usage within budgets
  • Framework comparison done: why is this better than alternatives?
  • Edge cases tested: what does agent do with ambiguous input? missing tool?

See Also

  • Doc 06 (Harness Architecture): Agent framework is component 4 (Planning Loop)
  • Doc 08 (Claw-Code Python): Working examples of ReAct implementation
  • Doc 15 (Prompt Engineering): System prompts guide agent behavior within chosen framework

2025–2026 Outlook

  • ReAct and Plan-Execute form the foundational patterns (proven, stable)
  • Research focus: Overcoming limitations with more sophisticated architectures
  • Production systems: Multi-agent teams working as specialized units
  • Trend: SLMs over LLMs for agentic loops (speed critical, latency intolerant)
  • Emerging: Graph of Thoughts and GoT variants for complex reasoning