Skip to main content
Reference

Memory in AI Systems: Layered Architecture

Four-layer memory architecture (context, working, persistent, auto-dream), RAG, the LLM Wiki pattern (compiled markdown knowledge), and unified reference diagram.

Memory in AI systems operates at multiple levels, mirroring biological cognition. A well-designed harness uses all four layers effectively.

Four Types of Memory

Layer 1: Context / Session Memory

What: The current conversation or execution context window

  • Duration: Single session only
  • Capacity: Limited (context window size: 128K–200K tokens typical)
  • Access: Immediate, included in every prompt
  • Cost: Counted per token (expensive for long contexts)
  • Use: Current task state, immediate history, working variables

In your harness:

  • Load CLAUDE.md instructions (5K–10K tokens)
  • Load compact MEMORY.md index (≤200 lines / 25K tokens)
  • Load current feature/progress file (2K–5K tokens)
  • Reserve remainder for session history and reasoning

Layer 2: Working Memory

What: Intermediate state during agentic loops

  • Duration: One task/feature (typically 1–5 iterations)
  • Capacity: Fits in context window
  • Access: Built up during session from observations
  • Typical size: 10K–50K tokens

In your harness:

  • Tool results accumulate here as agent reasons
  • Feature list (which items complete, which not)
  • Current debug information, error traces
  • Observation history (what happened when tool was called)

Layer 3: Persistent / Long-term Memory

What: Knowledge that survives across sessions

  • Duration: Project lifetime or longer
  • Capacity: Unbounded (stored in files or databases)
  • Access: Loaded on demand or at session start (first 200 lines)
  • Typical size: Can grow indefinitely

In your harness:

  • MEMORY.md index (pointers to topic files)
  • Topic files (debugging.md, api-conventions.md, architecture-decisions.md)
  • Project state (progress file, feature list)
  • Session transcripts (for auto-dream consolidation)

Structure:

harness-project/
├── CLAUDE.md              # Instructions (loaded at startup)
├── .claude/
│   └── memory/
│       ├── MEMORY.md      # Index file (loaded at startup)
│       ├── debugging.md   # Debugging patterns (load on-demand)
│       ├── decisions.md   # Architecture decisions
│       └── tools-api.md   # Tool/API conventions
└── progress.md            # Current project state

Layer 4: Auto-Dream (Consolidation)

What: Automatic memory cleanup and consolidation between sessions

  • When: Runs automatically between sessions
  • Trigger: 24+ hours since last cleanup AND 5+ new sessions accumulated
  • Duration: Run once, write once (consolidate then resume)

What it does:

  1. Gather signal: Scan session transcripts for user corrections, recurring themes, decisions
  2. Consolidate: Merge overlapping entries, convert relative→absolute dates
  3. Delete contradictions: Remove facts that were later corrected
  4. Prune: Keep MEMORY.md under 200 lines
  5. Update index: Add/remove pointers to topic files

Performance (from Anthropic testing):

  • Consolidated 913 sessions in 9 minutes
  • Reduced hallucination by 12% through accurate memory
  • Enabled agents to remember decisions from 50+ sessions back

Unified Memory Architecture Reference

This is the authoritative reference for the four-layer memory architecture. It’s used in doc 06 (Harness Architecture), doc 08 (Python implementation), and doc 09 (Operations), so it’s centralized here to ensure consistency.

Complete Four-Layer Stack

┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Context / Session Memory (128K–200K tokens)        │
│ ├─ Instructions (CLAUDE.md, 5-10K tokens)                   │
│ ├─ Memory index (MEMORY.md, 200 lines / 25K tokens)         │
│ ├─ Current task state (2-5K tokens)                         │
│ └─ Session history (remaining context, ~50-100K tokens)     │
│ [COST: Per token in every prompt] [LATENCY: Immediate]      │
└─────────────────────────────────────────────────────────────┘

                          │ (writes to)
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Working Memory (accumulates during session)         │
│ ├─ Tool results and observations                            │
│ ├─ Feature completion state                                 │
│ ├─ Error traces and debug info                              │
│ └─ Reasoning steps (10K-50K tokens per task)                │
│ [COST: Counted as context] [LATENCY: Built up real-time]    │
└─────────────────────────────────────────────────────────────┘

                          │ (consolidates into)
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Persistent / Long-term Memory (files/databases)    │
│ ├─ MEMORY.md index (topics)                                 │
│ ├─ Topic files (debugging, decisions, APIs)                 │
│ ├─ Project state (progress, features)                       │
│ └─ Session transcripts (for auto-dream, unbounded)          │
│ [COST: One-time load] [LATENCY: Load on-demand]             │
└─────────────────────────────────────────────────────────────┘

                          │ (periodically)
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Auto-Dream / Consolidation (background)            │
│ ├─ Scan transcripts for patterns                            │
│ ├─ Merge duplicates, remove contradictions                  │
│ ├─ Convert relative dates to absolute                       │
│ └─ Prune MEMORY.md to <200 lines                            │
│ [COST: One-time background] [LATENCY: Between sessions]     │
└─────────────────────────────────────────────────────────────┘

Comparison Table

LayerPurposeDurationAccess PatternToken CostExample
1: ContextCurrent task stateSingle sessionAlways includedPer tokenCurrent bug details + error stack
2: WorkingIntermediate reasoning1-5 iterationsBuilt during sessionPer token (context)Tool results, feature checklist
3: PersistentCross-session knowledgeProject lifetimeLoad on-demandOne-time per sessionArchitecture decisions, past bugs
4: Auto-DreamMemory consolidationBetween sessionsBackground processOne-time consolidationMerge redundant entries, learn patterns

When to Use Each Layer

  • Layer 1 (Context): Instructions, memory index, current task — must be <25% of context window
  • Layer 2 (Working): Accumulates during reasoning loop — grows as agent runs
  • Layer 3 (Persistent): Reference knowledge — loaded first 200 lines (MEMORY.md) + on-demand loads
  • Layer 4 (Auto-Dream): Run after 24+ hours + 5+ sessions — automatic optimization

Token Math Example

For a 128K context window with Claude 3 Sonnet:

Context window:      128,000 tokens
Overhead:
  - Instructions:      -10,000 (CLAUDE.md)
  - Memory index:      -25,000 (MEMORY.md)
  - Initialization:     -5,000 (system setup)
  - Buffer (10%):      -12,800 (safety margin)
───────────────────────────────
Available for work:  ~75,200 tokens

Split during session:
  - Agent reasoning:    50,000 (working memory, layer 2)
  - Tool results:       15,000 (observations)
  - User input:         10,200 (remaining buffer)

Cross-document note: Docs 06, 08, and 09 reference this architecture when discussing harness components, implementation, and monitoring respectively. Keep this section as the canonical reference.


RAG: Retrieval-Augmented Generation

Definition: Augmenting LLM responses by retrieving relevant information from external sources before generation, enabling models to access and utilize data beyond their training set.

How It Works

  1. User asks a question
  2. System retrieves relevant documents from knowledge base
  3. Documents are embedded as vectors (numerical representations)
  4. Similarity search finds top-K most relevant documents
  5. Documents are injected into model context
  6. Model generates answer grounded in retrieved information

Why RAG Matters

  • Real-time knowledge: Current news, real-time data
  • Grounded responses: Model must cite sources, reduces hallucination
  • Cost-effective: Alternative to fine-tuning (saves 60-80% of cost)
  • Privacy: Keep proprietary data local, never send to model provider
  • Flexibility: Update knowledge base without retraining

Example ROI

Instead of fine-tuning GPT on your 10,000 document repository:

  • Fine-tuning cost: $50K–$200K
  • RAG cost: ~$0.01 per query + one-time embedding ($100–$500)
  • Payback: 1000 queries = ROI positive

Vector Stores

Definition: Databases optimized for storing and retrieving high-dimensional vectors (embeddings).

Architecture

Text Input

Embedding Model (e.g., nomic-embed-text)

Vector (768–1536 dimensions)

Vector Database (Pinecone, Weaviate, Qdrant)

Similarity Search (cosine distance, L2 distance)

Top-K Results + Metadata
StoreHostingPricingBest For
PineconeCloudPay-per-queryManaged, production-grade
WeaviateSelf-hostedOpen-sourceFull control, GraphQL API
QdrantSelf-hostedOpen-sourceHigh throughput, scalable
MilvusSelf-hostedOpen-sourceDistributed, large scale
ChromaDBLocalOpen-sourceLocal development, lightweight

KV Cache Optimization Impact on RAG

Modern KV cache techniques (GQA, INT8/INT4 quantization, PagedAttention, TurboQuant) also benefit RAG workloads:

  • Longer context windows: More retrieved documents fit in the prompt
  • Lower memory usage: Frees VRAM for larger batch sizes
  • Higher throughput: More RAG queries processed per second
  • Vector search acceleration: TurboQuant’s QJL algorithm also applies to vector similarity search, achieving superior 1@k recall ratios compared to PQ and RabbiQ baselines (tested on GloVe, d=200)
  • See Doc 02 for details on specific KV cache techniques

Memory Patterns in Harnesses

Memory Pattern Used in Production Agent Systems

Session start:

1. Load CLAUDE.md files (config + instructions) — 5K tokens
2. Load MEMORY.md index (pointers) — <1K tokens
3. Load project state file — 2K tokens
4. Ready for work — ~8K tokens used

During session:

1. Agent reasons and acts
2. Tools execute, return observations
3. Session history accumulates
4. On-demand: Load topic file when specific knowledge needed

Between sessions (Auto-Dream):

if (time_since_consolidation > 24 hours AND new_sessions > 5) {
  1. Scan session transcripts
  2. Extract signals: corrections, patterns, decisions
  3. Merge with MEMORY.md
  4. Remove contradictions, old references
  5. Keep index ≤ 200 lines
  6. Update pointers to topic files
}

Implementation Checklist

  • Create CLAUDE.md with core rules and instructions
  • Set up MEMORY.md as compact index (≤200 lines)
  • Create topic files: debugging.md, decisions.md, conventions.md
  • Establish progress.md for tracking feature completion
  • Implement session startup: load instructions + memory index
  • Implement auto-dream: consolidate memory every 24h or 5 sessions
  • Monitor context usage: ensure startup load <10% of context window
  • Set up RAG if working with large proprietary knowledge bases
  • For long contexts: Enable INT8/INT4 KV cache quantization (see Doc 02)

Context Budgeting for Your Harness

Assuming 128K context window:

Instructions (CLAUDE.md)       ~5K tokens (4%)
Memory index (MEMORY.md)       ~1K tokens (1%)
Project state (progress.md)    ~3K tokens (2%)
                        ─────────────────
Loaded at startup            ~9K tokens (7%)

Topic files (on-demand)        ~10K tokens (8%) [when needed]
Current session/loop          ~30K tokens (23%) [working area]
                        ─────────────────
Available for tool results    ~79K tokens (62%) [buffer]

Rule of thumb: Keep startup load <10% so agent has 90% for actual work.


Stateless Agent Pattern: Rebuilding Context From Ground Truth

For agents that run iteratively (search, reason, search again), there is an alternative to maintaining a growing conversation: make the LLM stateless and rebuild context from ground truth on every call.

How It Works

The LLM has no memory between calls. All state lives in Python and on disk. Each call, Python renders a compact narrative from the current state, and the LLM receives it as a fresh prompt.

Call 1:
  Python renders: "Subject: John Smith. No records found yet. Search FreeBMD next."
  LLM responds:   {"action": "search", "source": "freebmd", "query": "Smith 1842"}
  Python executes the search, saves results to disk.

Call 2:
  Python renders: "Subject: John Smith. FreeBMD returned 3 matches. [match details].
                   Evaluate which match is correct."
  LLM responds:   {"action": "select", "match_id": 2, "confidence": 0.85, "reason": "..."}
  Python saves the selection, moves to next step.

Call 3:
  Python renders: "Subject: John Smith. Selected match: born 1842, Lambeth.
                   Search for marriage record next."
  LLM responds:   {"action": "search", "source": "ancestry", "query": "Smith marriage 1860-1870"}

The context window is just a viewport into the latest state. No conversation history accumulates. No compression is needed. No context overflow occurs.

Mapping to the Four-Layer Memory Model

This pattern uses the same four layers, but allocates them differently:

LayerTraditional AgentStateless Agent
1: ContextSystem prompt + conversation history (grows)System prompt + rendered narrative + new results (rebuilt each call, fixed size)
2: WorkingAccumulates in context windowPython in-process variables (discarded after each call)
3: PersistentFiles on disk, loaded on demandJSON files on disk (the single source of truth, updated after every call)
4: ConsolidationAuto-dream merges sessionsNot needed — state is already compact by design

Why This Works

The key insight is that Layer 3 (persistent state on disk) becomes the authoritative record, not the conversation history. Python reads the current state, renders it as a compact narrative (typically 200-500 tokens), appends the latest results, and sends this as a fresh prompt. The LLM never sees the history of how the state was built — only the current state and the next decision to make.

Advantages Over Conversation-Based Agents

  • No growing context: The 100th call uses the same number of tokens as the 1st
  • No degradation over long runs: No summarisation artifacts, no forgotten details, no attention decay
  • No context overflow: State size is controlled by Python, not by conversation length
  • Deterministic replay: The same state file always produces the same prompt, making debugging trivial
  • Crash recovery for free: If the process crashes, restart it — Python reads state from disk and continues from where it left off

When to Use This Pattern

Use stateless agents when:

  • The task is iterative (many calls with incremental progress)
  • State can be represented compactly (structured data, not open-ended conversation)
  • You need to run hundreds or thousands of iterations without degradation
  • Crash recovery matters (the agent must resume cleanly)

Use conversation-based agents when:

  • The interaction is genuinely conversational (human in the loop, back-and-forth)
  • Context from earlier turns is semantically important and hard to render from state
  • The task completes in fewer than 10 turns (context growth is not an issue)

Validation Checklist

How do you know you got this right?

Performance Checks

  • Startup context load <10% of context window (measured with tokenizer)
  • Memory index (MEMORY.md) loads in <1 second at session start
  • MEMORY.md stays under 200 lines consistently
  • Auto-dream consolidation completes in <5 minutes for 5+ sessions

Implementation Checks

  • Created CLAUDE.md with core instructions (5K-10K tokens)
  • Created MEMORY.md as compact index with backlinks (<200 lines)
  • Set up topic files: debugging.md, decisions.md, conventions.md
  • Progress file tracks feature completion (JSON or markdown)
  • Session startup loads CLAUDE.md + MEMORY.md, NOT all topic files
  • Auto-dream runs automatically after 24h or 5+ sessions

Integration Checks

  • Layer 1 (context) feeds layer 2 (working memory during loops)
  • Layer 2 results consolidate into Layer 3 during auto-dream
  • Layer 3 (persistent files) reload cleanly at next session start
  • Memory index links are accurate (no broken pointers to topic files)

Common Failure Modes

  • Context window exhaustion: Startup load >15%; trim MEMORY.md or move to topic files
  • Sessions forgetting previous work: MEMORY.md not being auto-updated; check auto-dream runs
  • MEMORY.md grows unbounded: No consolidation happening; implement auto-dream trigger
  • Topic files loaded at startup: Should be on-demand only; move out of session initialization

Sign-Off Criteria

  • Memory layers implemented and tested across 3+ sessions
  • Startup latency <2 seconds even with large MEMORY.md
  • Agent can reference past decisions from previous sessions
  • Auto-dream merges overlapping entries and prunes stale info
  • Context budget respected: work area has 75%+ free space

See Also

  • Doc 05 (AI Agents): ReAct loop builds working memory during perceive-reason-act cycles
  • Doc 06 (Harness Architecture): Memory management is component 3 of seven-component system
  • Doc 08 (Claw-Code Python): Reference implementation of file-based memory patterns

Alternative to RAG: The LLM Wiki Pattern (Compiled Markdown Knowledge)

A pattern described by Andrej Karpathy in his LLM Wiki gist (April 4, 2026) challenges the vector database approach for small-to-medium knowledge bases. The core insight: instead of re-deriving answers from raw sources on every query, an LLM incrementally maintains a persistent wiki of compiled knowledge.

The LLM Wiki Pattern

Karpathy describes three layers:

  1. Raw Sources — Immutable curated documents (articles, papers, images, data files)
  2. The Wiki — LLM-generated markdown files with summaries, entity pages, cross-references
  3. The Schema — Configuration document (e.g., CLAUDE.md) defining wiki structure and conventions

Special files:

  • index.md — Content-oriented catalogue of all pages organised by category with one-line summaries
  • log.md — Append-only chronological record with parseable prefixes (e.g., ## [2026-04-02] ingest | Article Title)
knowledge-base/
├── raw/              # Source material (unaltered)
│   ├── papers/
│   ├── docs/
│   └── web-clips/
└── wiki/             # LLM-compiled knowledge (structured markdown)
    ├── index.md      # Content catalogue, entry points
    ├── log.md        # Append-only chronological record
    ├── concepts/
    │   ├── topic_a.md
    │   ├── topic_b.md
    │   └── backlinks/
    └── summaries/

Three core operations (from the gist):

  • Ingest: Drop a new source; LLM reads it, discusses takeaways, writes a summary, updates 10-15 wiki pages, appends a log entry
  • Query: Ask questions against wiki pages; LLM synthesises answers with citations, files valuable results back as new pages
  • Lint: Periodically health-check for contradictions, stale claims, orphan pages, missing cross-references

LLM role shifts: From retriever to librarian.

  • Reads raw source files
  • Compiles structured wiki pages (summaries, key concepts, encyclopedia-style articles)
  • Maintains backlinks between related ideas
  • Periodically lints the wiki (health checks, finds inconsistencies, updates stale info)

How queries work:

  1. Query comes in
  2. Look up relevant wiki page(s) using index + natural language matching
  3. Inject wiki content (already summarised, structured, interconnected)
  4. Model generates response from compiled knowledge

Why This Matters: Compiled vs. Raw

Traditional RAG:

Raw Papers → Vector Index → Similarity Search → Chunk Retrieval → Generation
(Every query re-reads papers, re-chunks, re-synthesizes)

LLM Wiki Pattern:

Raw Papers → LLM Compiles → Structured Wiki → Query Wiki → Generation
(Papers compiled once; queries run against compiled artifact)

Analogy: Source code vs. compiled binary. The key insight: compile your knowledge first.

Practical Implementation

Setup (from the gist’s recommended tooling):

  • Use Obsidian Web Clipper to convert web articles to markdown
  • Use Obsidian Graph View to visualise wiki connections
  • Use Dataview to query page frontmatter
  • Use qmd (local markdown search engine with CLI and MCP support) when the wiki outgrows simple index lookup
  • Store images locally (so LLM vision can reference them)
  • LLM reads raw markdown, writes structured wiki pages
  • Run periodic linting passes
  • Use git for version control of the markdown repository

Karpathy emphasises this is an “idea file” designed to be adapted: “everything mentioned above is optional and modular.”

Scale Limits & Trade-offs

Best for (Karpathy notes the index approach works at “small scale (~100 sources, ~hundreds of pages)”):

  • Personal knowledge bases
  • Small-team wikis
  • Internal company wikis
  • Project-specific knowledge

Why it works at this scale:

  • Well-organised markdown with summaries
  • Index files act as routing
  • LLM can reason over entire structure
  • More useful context than vector search

At larger scales: Karpathy recommends adding qmd (a local search engine with “hybrid BM25/vector search and LLM re-ranking”) for collections that outgrow the index approach.

At enterprise scale (millions of documents, strict latency):

  • Traditional retrieval infrastructure (vector DBs) still necessary
  • But principle still applies: compile/summarise first, then retrieve
  • Hybrid: compile into chunks, then vector index the chunks

Hybrid Approach: RAG + LLM Wiki

For medium-large harnesses, combine both:

Raw Sources

[LLM Compile Phase]
    ├─ Create wiki summaries
    ├─ Extract key concepts
    └─ Generate backlinks

Structured Wiki Markdown

[Vector Index the Wiki] (not raw sources)
    ├─ Embed wiki pages (not raw PDFs)
    ├─ Index structured content
    └─ Enable semantic search

Query Time:
  1. Try direct wiki lookup (fast)
  2. If needed, semantic search in wiki index
  3. Inject compiled knowledge

Implications for Your Harness

Apply LLM Wiki pattern to long-term memory:

harness-project/.claude/memory/
├── raw/
│   ├── session-transcripts/  # Auto-saved session outputs
│   ├── error-logs/           # Failures, debugging info
│   └── decisions-log/        # Raw notes on choices
└── wiki/
    ├── MEMORY.md             # Curated index + routing
    ├── debugging.md          # Compiled debugging patterns
    ├── architecture.md       # Architecture decisions (compiled)
    ├── api-conventions.md    # API/tool patterns
    └── backlinks/            # Cross-references

During auto-dream consolidation:

1. Scan raw/ (session transcripts, errors, decisions)
2. Extract signals (patterns, corrections, insights)
3. Compile into wiki/ (structured markdown)
4. Update MEMORY.md index with new backlinks
5. Delete old raw sessions (keep last 10)

Benefits:

  • Wiki pages are human-readable (you can edit them)
  • Backlinks create mental model (connections surface)
  • Auto-linting finds stale/contradictory entries
  • More useful than vector search for structured reasoning

2026 Outlook

The debate: Is RAG obsolete for small-to-medium cases?

Measured view:

  • Principle (compiled, structured, LLM-maintained) is correct
  • Whether plain markdown is the substrate depends on scale
  • Hybrid (compile + vector index) likely sweet spot
  • For personal projects + small teams: pure markdown wiki wins
  • For enterprise (millions docs, strict latency): retrieval infrastructure necessary

Community impact: Karpathy’s gist has sparked significant discussion. The industry is actively questioning whether traditional RAG is overengineered for small-to-medium knowledge bases.

Market Context: RAG in 2025–2026

  • RAG market: Growing rapidly, with some estimates placing it at over $1B in 2024
  • Trend shift: From pure RAG → compiled + structured knowledge
  • Hybrid approaches: Compile first, then optionally vector index
  • Cost driver: Inference (queries) > training
  • 2026 outlook:
    • KV cache quantization (GQA, INT8/INT4, TurboQuant 3-bit) makes inference more efficient — see Doc 02
    • LLM Wiki pattern gaining traction for small-to-medium teams
    • Hybrid (markdown wiki + optional vector index) emerging as standard

Citations