Memory in AI Systems: Layered Architecture — The Harness Handbook Reference

Memory in AI systems operates at multiple levels, mirroring biological cognition. A well-designed harness uses all four layers effectively.

Four Types of Memory

Layer 1: Context / Session Memory

What: The current conversation or execution context window

Duration: Single session only
Capacity: Limited (context window size: 128K–200K tokens typical)
Access: Immediate, included in every prompt
Cost: Counted per token (expensive for long contexts)
Use: Current task state, immediate history, working variables

In your harness:

Load CLAUDE.md instructions (5K–10K tokens)
Load compact MEMORY.md index (≤200 lines / 25K tokens)
Load current feature/progress file (2K–5K tokens)
Reserve remainder for session history and reasoning

Layer 2: Working Memory

What: Intermediate state during agentic loops

Duration: One task/feature (typically 1–5 iterations)
Capacity: Fits in context window
Access: Built up during session from observations
Typical size: 10K–50K tokens

In your harness:

Tool results accumulate here as agent reasons
Feature list (which items complete, which not)
Current debug information, error traces
Observation history (what happened when tool was called)

Layer 3: Persistent / Long-term Memory

What: Knowledge that survives across sessions

Duration: Project lifetime or longer
Capacity: Unbounded (stored in files or databases)
Access: Loaded on demand or at session start (first 200 lines)
Typical size: Can grow indefinitely

In your harness:

MEMORY.md index (pointers to topic files)
Topic files (debugging.md, api-conventions.md, architecture-decisions.md)
Project state (progress file, feature list)
Session transcripts (for auto-dream consolidation)

Structure:

harness-project/
├── CLAUDE.md              # Instructions (loaded at startup)
├── .claude/
│   └── memory/
│       ├── MEMORY.md      # Index file (loaded at startup)
│       ├── debugging.md   # Debugging patterns (load on-demand)
│       ├── decisions.md   # Architecture decisions
│       └── tools-api.md   # Tool/API conventions
└── progress.md            # Current project state

Layer 4: Auto-Dream (Consolidation)

What: Automatic memory cleanup and consolidation between sessions

When: Runs automatically between sessions
Trigger: 24+ hours since last cleanup AND 5+ new sessions accumulated
Duration: Run once, write once (consolidate then resume)

What it does:

Gather signal: Scan session transcripts for user corrections, recurring themes, decisions
Consolidate: Merge overlapping entries, convert relative→absolute dates
Delete contradictions: Remove facts that were later corrected
Prune: Keep MEMORY.md under 200 lines
Update index: Add/remove pointers to topic files

Performance (from Anthropic testing):

Consolidated 913 sessions in 9 minutes
Reduced hallucination by 12% through accurate memory
Enabled agents to remember decisions from 50+ sessions back

Unified Memory Architecture Reference

This is the authoritative reference for the four-layer memory architecture. It’s used in doc 06 (Harness Architecture), doc 08 (Python implementation), and doc 09 (Operations), so it’s centralized here to ensure consistency.

Complete Four-Layer Stack

┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Context / Session Memory (128K–200K tokens)        │
│ ├─ Instructions (CLAUDE.md, 5-10K tokens)                   │
│ ├─ Memory index (MEMORY.md, 200 lines / 25K tokens)         │
│ ├─ Current task state (2-5K tokens)                         │
│ └─ Session history (remaining context, ~50-100K tokens)     │
│ [COST: Per token in every prompt] [LATENCY: Immediate]      │
└─────────────────────────────────────────────────────────────┘
                          ▲
                          │ (writes to)
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Working Memory (accumulates during session)         │
│ ├─ Tool results and observations                            │
│ ├─ Feature completion state                                 │
│ ├─ Error traces and debug info                              │
│ └─ Reasoning steps (10K-50K tokens per task)                │
│ [COST: Counted as context] [LATENCY: Built up real-time]    │
└─────────────────────────────────────────────────────────────┘
                          ▲
                          │ (consolidates into)
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Persistent / Long-term Memory (files/databases)    │
│ ├─ MEMORY.md index (topics)                                 │
│ ├─ Topic files (debugging, decisions, APIs)                 │
│ ├─ Project state (progress, features)                       │
│ └─ Session transcripts (for auto-dream, unbounded)          │
│ [COST: One-time load] [LATENCY: Load on-demand]             │
└─────────────────────────────────────────────────────────────┘
                          ▲
                          │ (periodically)
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Auto-Dream / Consolidation (background)            │
│ ├─ Scan transcripts for patterns                            │
│ ├─ Merge duplicates, remove contradictions                  │
│ ├─ Convert relative dates to absolute                       │
│ └─ Prune MEMORY.md to <200 lines                            │
│ [COST: One-time background] [LATENCY: Between sessions]     │
└─────────────────────────────────────────────────────────────┘

Comparison Table

Layer	Purpose	Duration	Access Pattern	Token Cost	Example
1: Context	Current task state	Single session	Always included	Per token	Current bug details + error stack
2: Working	Intermediate reasoning	1-5 iterations	Built during session	Per token (context)	Tool results, feature checklist
3: Persistent	Cross-session knowledge	Project lifetime	Load on-demand	One-time per session	Architecture decisions, past bugs
4: Auto-Dream	Memory consolidation	Between sessions	Background process	One-time consolidation	Merge redundant entries, learn patterns

When to Use Each Layer

Layer 1 (Context): Instructions, memory index, current task — must be <25% of context window
Layer 2 (Working): Accumulates during reasoning loop — grows as agent runs
Layer 3 (Persistent): Reference knowledge — loaded first 200 lines (MEMORY.md) + on-demand loads
Layer 4 (Auto-Dream): Run after 24+ hours + 5+ sessions — automatic optimization

Token Math Example

For a 128K context window with Claude 3 Sonnet:

Context window:      128,000 tokens
Overhead:
  - Instructions:      -10,000 (CLAUDE.md)
  - Memory index:      -25,000 (MEMORY.md)
  - Initialization:     -5,000 (system setup)
  - Buffer (10%):      -12,800 (safety margin)
───────────────────────────────
Available for work:  ~75,200 tokens

Split during session:
  - Agent reasoning:    50,000 (working memory, layer 2)
  - Tool results:       15,000 (observations)
  - User input:         10,200 (remaining buffer)

Cross-document note: Docs 06, 08, and 09 reference this architecture when discussing harness components, implementation, and monitoring respectively. Keep this section as the canonical reference.

RAG: Retrieval-Augmented Generation

Definition: Augmenting LLM responses by retrieving relevant information from external sources before generation, enabling models to access and utilize data beyond their training set.

How It Works

User asks a question
System retrieves relevant documents from knowledge base
Documents are embedded as vectors (numerical representations)
Similarity search finds top-K most relevant documents
Documents are injected into model context
Model generates answer grounded in retrieved information

Why RAG Matters

Real-time knowledge: Current news, real-time data
Grounded responses: Model must cite sources, reduces hallucination
Cost-effective: Alternative to fine-tuning (saves 60-80% of cost)
Privacy: Keep proprietary data local, never send to model provider
Flexibility: Update knowledge base without retraining

Example ROI

Instead of fine-tuning GPT on your 10,000 document repository:

Fine-tuning cost: $50K–$200K
RAG cost: ~$0.01 per query + one-time embedding ($100–$500)
Payback: 1000 queries = ROI positive

Vector Stores

Definition: Databases optimized for storing and retrieving high-dimensional vectors (embeddings).

Architecture

Text Input
    ↓
Embedding Model (e.g., nomic-embed-text)
    ↓
Vector (768–1536 dimensions)
    ↓
Vector Database (Pinecone, Weaviate, Qdrant)
    ↓
Similarity Search (cosine distance, L2 distance)
    ↓
Top-K Results + Metadata

Popular Options

Store	Hosting	Pricing	Best For
Pinecone	Cloud	Pay-per-query	Managed, production-grade
Weaviate	Self-hosted	Open-source	Full control, GraphQL API
Qdrant	Self-hosted	Open-source	High throughput, scalable
Milvus	Self-hosted	Open-source	Distributed, large scale
ChromaDB	Local	Open-source	Local development, lightweight

KV Cache Optimization Impact on RAG

Modern KV cache techniques (GQA, INT8/INT4 quantization, PagedAttention, TurboQuant) also benefit RAG workloads:

Longer context windows: More retrieved documents fit in the prompt
Lower memory usage: Frees VRAM for larger batch sizes
Higher throughput: More RAG queries processed per second
Vector search acceleration: TurboQuant’s QJL algorithm also applies to vector similarity search, achieving superior 1@k recall ratios compared to PQ and RabbiQ baselines (tested on GloVe, d=200)
See Doc 02 for details on specific KV cache techniques

Memory Patterns in Harnesses

Memory Pattern Used in Production Agent Systems

Session start:

1. Load CLAUDE.md files (config + instructions) — 5K tokens
2. Load MEMORY.md index (pointers) — <1K tokens
3. Load project state file — 2K tokens
4. Ready for work — ~8K tokens used

During session:

1. Agent reasons and acts
2. Tools execute, return observations
3. Session history accumulates
4. On-demand: Load topic file when specific knowledge needed

Between sessions (Auto-Dream):

if (time_since_consolidation > 24 hours AND new_sessions > 5) {
  1. Scan session transcripts
  2. Extract signals: corrections, patterns, decisions
  3. Merge with MEMORY.md
  4. Remove contradictions, old references
  5. Keep index ≤ 200 lines
  6. Update pointers to topic files
}

Implementation Checklist

Create CLAUDE.md with core rules and instructions
Set up MEMORY.md as compact index (≤200 lines)
Create topic files: debugging.md, decisions.md, conventions.md
Establish progress.md for tracking feature completion
Implement session startup: load instructions + memory index
Implement auto-dream: consolidate memory every 24h or 5 sessions
Monitor context usage: ensure startup load <10% of context window
Set up RAG if working with large proprietary knowledge bases
For long contexts: Enable INT8/INT4 KV cache quantization (see Doc 02)

Context Budgeting for Your Harness

Assuming 128K context window:

Instructions (CLAUDE.md)       ~5K tokens (4%)
Memory index (MEMORY.md)       ~1K tokens (1%)
Project state (progress.md)    ~3K tokens (2%)
                        ─────────────────
Loaded at startup            ~9K tokens (7%)

Topic files (on-demand)        ~10K tokens (8%) [when needed]
Current session/loop          ~30K tokens (23%) [working area]
                        ─────────────────
Available for tool results    ~79K tokens (62%) [buffer]

Rule of thumb: Keep startup load <10% so agent has 90% for actual work.

Stateless Agent Pattern: Rebuilding Context From Ground Truth

For agents that run iteratively (search, reason, search again), there is an alternative to maintaining a growing conversation: make the LLM stateless and rebuild context from ground truth on every call.

How It Works

The LLM has no memory between calls. All state lives in Python and on disk. Each call, Python renders a compact narrative from the current state, and the LLM receives it as a fresh prompt.

Call 1:
  Python renders: "Subject: John Smith. No records found yet. Search FreeBMD next."
  LLM responds:   {"action": "search", "source": "freebmd", "query": "Smith 1842"}
  Python executes the search, saves results to disk.

Call 2:
  Python renders: "Subject: John Smith. FreeBMD returned 3 matches. [match details].
                   Evaluate which match is correct."
  LLM responds:   {"action": "select", "match_id": 2, "confidence": 0.85, "reason": "..."}
  Python saves the selection, moves to next step.

Call 3:
  Python renders: "Subject: John Smith. Selected match: born 1842, Lambeth.
                   Search for marriage record next."
  LLM responds:   {"action": "search", "source": "ancestry", "query": "Smith marriage 1860-1870"}

The context window is just a viewport into the latest state. No conversation history accumulates. No compression is needed. No context overflow occurs.

Mapping to the Four-Layer Memory Model

This pattern uses the same four layers, but allocates them differently:

Layer	Traditional Agent	Stateless Agent
1: Context	System prompt + conversation history (grows)	System prompt + rendered narrative + new results (rebuilt each call, fixed size)
2: Working	Accumulates in context window	Python in-process variables (discarded after each call)
3: Persistent	Files on disk, loaded on demand	JSON files on disk (the single source of truth, updated after every call)
4: Consolidation	Auto-dream merges sessions	Not needed — state is already compact by design

Why This Works

The key insight is that Layer 3 (persistent state on disk) becomes the authoritative record, not the conversation history. Python reads the current state, renders it as a compact narrative (typically 200-500 tokens), appends the latest results, and sends this as a fresh prompt. The LLM never sees the history of how the state was built — only the current state and the next decision to make.

Advantages Over Conversation-Based Agents

No growing context: The 100th call uses the same number of tokens as the 1st
No degradation over long runs: No summarisation artifacts, no forgotten details, no attention decay
No context overflow: State size is controlled by Python, not by conversation length
Deterministic replay: The same state file always produces the same prompt, making debugging trivial
Crash recovery for free: If the process crashes, restart it — Python reads state from disk and continues from where it left off

When to Use This Pattern

Use stateless agents when:

The task is iterative (many calls with incremental progress)
State can be represented compactly (structured data, not open-ended conversation)
You need to run hundreds or thousands of iterations without degradation
Crash recovery matters (the agent must resume cleanly)

Use conversation-based agents when:

The interaction is genuinely conversational (human in the loop, back-and-forth)
Context from earlier turns is semantically important and hard to render from state
The task completes in fewer than 10 turns (context growth is not an issue)

Validation Checklist

How do you know you got this right?

Performance Checks

Startup context load <10% of context window (measured with tokenizer)
Memory index (MEMORY.md) loads in <1 second at session start
MEMORY.md stays under 200 lines consistently
Auto-dream consolidation completes in <5 minutes for 5+ sessions

Implementation Checks

Created CLAUDE.md with core instructions (5K-10K tokens)
Created MEMORY.md as compact index with backlinks (<200 lines)
Set up topic files: debugging.md, decisions.md, conventions.md
Progress file tracks feature completion (JSON or markdown)
Session startup loads CLAUDE.md + MEMORY.md, NOT all topic files
Auto-dream runs automatically after 24h or 5+ sessions

Integration Checks

Layer 1 (context) feeds layer 2 (working memory during loops)
Layer 2 results consolidate into Layer 3 during auto-dream
Layer 3 (persistent files) reload cleanly at next session start
Memory index links are accurate (no broken pointers to topic files)

Common Failure Modes

Context window exhaustion: Startup load >15%; trim MEMORY.md or move to topic files
Sessions forgetting previous work: MEMORY.md not being auto-updated; check auto-dream runs
MEMORY.md grows unbounded: No consolidation happening; implement auto-dream trigger
Topic files loaded at startup: Should be on-demand only; move out of session initialization

Sign-Off Criteria

Memory layers implemented and tested across 3+ sessions
Startup latency <2 seconds even with large MEMORY.md
Agent can reference past decisions from previous sessions
Auto-dream merges overlapping entries and prunes stale info
Context budget respected: work area has 75%+ free space

Alternative to RAG: The LLM Wiki Pattern (Compiled Markdown Knowledge)

A pattern described by Andrej Karpathy in his LLM Wiki gist (April 4, 2026) challenges the vector database approach for small-to-medium knowledge bases. The core insight: instead of re-deriving answers from raw sources on every query, an LLM incrementally maintains a persistent wiki of compiled knowledge.

The LLM Wiki Pattern

Karpathy describes three layers:

Raw Sources — Immutable curated documents (articles, papers, images, data files)
The Wiki — LLM-generated markdown files with summaries, entity pages, cross-references
The Schema — Configuration document (e.g., CLAUDE.md) defining wiki structure and conventions

Special files:

index.md — Content-oriented catalogue of all pages organised by category with one-line summaries
log.md — Append-only chronological record with parseable prefixes (e.g., ## [2026-04-02] ingest | Article Title)

knowledge-base/
├── raw/              # Source material (unaltered)
│   ├── papers/
│   ├── docs/
│   └── web-clips/
└── wiki/             # LLM-compiled knowledge (structured markdown)
    ├── index.md      # Content catalogue, entry points
    ├── log.md        # Append-only chronological record
    ├── concepts/
    │   ├── topic_a.md
    │   ├── topic_b.md
    │   └── backlinks/
    └── summaries/

Three core operations (from the gist):

Ingest: Drop a new source; LLM reads it, discusses takeaways, writes a summary, updates 10-15 wiki pages, appends a log entry
Query: Ask questions against wiki pages; LLM synthesises answers with citations, files valuable results back as new pages
Lint: Periodically health-check for contradictions, stale claims, orphan pages, missing cross-references

LLM role shifts: From retriever to librarian.

Reads raw source files
Compiles structured wiki pages (summaries, key concepts, encyclopedia-style articles)
Maintains backlinks between related ideas
Periodically lints the wiki (health checks, finds inconsistencies, updates stale info)

How queries work:

Query comes in
Look up relevant wiki page(s) using index + natural language matching
Inject wiki content (already summarised, structured, interconnected)
Model generates response from compiled knowledge

Why This Matters: Compiled vs. Raw

Traditional RAG:

Raw Papers → Vector Index → Similarity Search → Chunk Retrieval → Generation
(Every query re-reads papers, re-chunks, re-synthesizes)

LLM Wiki Pattern:

Raw Papers → LLM Compiles → Structured Wiki → Query Wiki → Generation
(Papers compiled once; queries run against compiled artifact)

Analogy: Source code vs. compiled binary. The key insight: compile your knowledge first.

Practical Implementation

Setup (from the gist’s recommended tooling):

Use Obsidian Web Clipper to convert web articles to markdown
Use Obsidian Graph View to visualise wiki connections
Use Dataview to query page frontmatter
Use qmd (local markdown search engine with CLI and MCP support) when the wiki outgrows simple index lookup
Store images locally (so LLM vision can reference them)
LLM reads raw markdown, writes structured wiki pages
Run periodic linting passes
Use git for version control of the markdown repository

Karpathy emphasises this is an “idea file” designed to be adapted: “everything mentioned above is optional and modular.”

Scale Limits & Trade-offs

Best for (Karpathy notes the index approach works at “small scale (~100 sources, ~hundreds of pages)”):

Personal knowledge bases
Small-team wikis
Internal company wikis
Project-specific knowledge

Why it works at this scale:

Well-organised markdown with summaries
Index files act as routing
LLM can reason over entire structure
More useful context than vector search

At larger scales: Karpathy recommends adding qmd (a local search engine with “hybrid BM25/vector search and LLM re-ranking”) for collections that outgrow the index approach.

At enterprise scale (millions of documents, strict latency):

Traditional retrieval infrastructure (vector DBs) still necessary
But principle still applies: compile/summarise first, then retrieve
Hybrid: compile into chunks, then vector index the chunks

Hybrid Approach: RAG + LLM Wiki

For medium-large harnesses, combine both:

Raw Sources
    ↓
[LLM Compile Phase]
    ├─ Create wiki summaries
    ├─ Extract key concepts
    └─ Generate backlinks
    ↓
Structured Wiki Markdown
    ↓
[Vector Index the Wiki] (not raw sources)
    ├─ Embed wiki pages (not raw PDFs)
    ├─ Index structured content
    └─ Enable semantic search
    ↓
Query Time:
  1. Try direct wiki lookup (fast)
  2. If needed, semantic search in wiki index
  3. Inject compiled knowledge

Implications for Your Harness

Apply LLM Wiki pattern to long-term memory:

harness-project/.claude/memory/
├── raw/
│   ├── session-transcripts/  # Auto-saved session outputs
│   ├── error-logs/           # Failures, debugging info
│   └── decisions-log/        # Raw notes on choices
└── wiki/
    ├── MEMORY.md             # Curated index + routing
    ├── debugging.md          # Compiled debugging patterns
    ├── architecture.md       # Architecture decisions (compiled)
    ├── api-conventions.md    # API/tool patterns
    └── backlinks/            # Cross-references

During auto-dream consolidation:

1. Scan raw/ (session transcripts, errors, decisions)
2. Extract signals (patterns, corrections, insights)
3. Compile into wiki/ (structured markdown)
4. Update MEMORY.md index with new backlinks
5. Delete old raw sessions (keep last 10)

Benefits:

Wiki pages are human-readable (you can edit them)
Backlinks create mental model (connections surface)
Auto-linting finds stale/contradictory entries
More useful than vector search for structured reasoning

2026 Outlook

The debate: Is RAG obsolete for small-to-medium cases?

Measured view:

Principle (compiled, structured, LLM-maintained) is correct
Whether plain markdown is the substrate depends on scale
Hybrid (compile + vector index) likely sweet spot
For personal projects + small teams: pure markdown wiki wins
For enterprise (millions docs, strict latency): retrieval infrastructure necessary

Community impact: Karpathy’s gist has sparked significant discussion. The industry is actively questioning whether traditional RAG is overengineered for small-to-medium knowledge bases.

Market Context: RAG in 2025–2026

RAG market: Growing rapidly, with some estimates placing it at over $1B in 2024
Trend shift: From pure RAG → compiled + structured knowledge
Hybrid approaches: Compile first, then optionally vector index
Cost driver: Inference (queries) > training
2026 outlook:
- KV cache quantization (GQA, INT8/INT4, TurboQuant 3-bit) makes inference more efficient — see Doc 02
- LLM Wiki pattern gaining traction for small-to-medium teams
- Hybrid (markdown wiki + optional vector index) emerging as standard

Citations

Andrej Karpathy’s LLM Wiki Gist — April 4, 2026. Describes the three-layer LLM Wiki pattern (Raw Sources, Wiki, Schema) with Ingest/Query/Lint operations.
TurboQuant: Redefining AI Efficiency with Extreme Compression — Google Research Blog, March 24, 2026. Referenced for vector search acceleration benefits relevant to RAG.

Four Types of Memory

Layer 1: Context / Session Memory

Layer 2: Working Memory

Layer 3: Persistent / Long-term Memory

Layer 4: Auto-Dream (Consolidation)

Unified Memory Architecture Reference

Complete Four-Layer Stack

Comparison Table

When to Use Each Layer

Token Math Example

RAG: Retrieval-Augmented Generation

How It Works

Why RAG Matters

Example ROI

Vector Stores

Architecture

Popular Options

KV Cache Optimization Impact on RAG

Memory Patterns in Harnesses

Memory Pattern Used in Production Agent Systems

Implementation Checklist

Context Budgeting for Your Harness

Stateless Agent Pattern: Rebuilding Context From Ground Truth

How It Works

Mapping to the Four-Layer Memory Model

Why This Works

Advantages Over Conversation-Based Agents

When to Use This Pattern

Validation Checklist

Performance Checks

Implementation Checks

Integration Checks

Common Failure Modes

Sign-Off Criteria

See Also

Alternative to RAG: The LLM Wiki Pattern (Compiled Markdown Knowledge)

The LLM Wiki Pattern

Why This Matters: Compiled vs. Raw

Practical Implementation

Scale Limits & Trade-offs

Hybrid Approach: RAG + LLM Wiki

Implications for Your Harness

2026 Outlook

Market Context: RAG in 2025–2026

Citations