Memory in AI Systems: Layered Architecture
Four-layer memory architecture (context, working, persistent, auto-dream), RAG, the LLM Wiki pattern (compiled markdown knowledge), and unified reference diagram.
Memory in AI systems operates at multiple levels, mirroring biological cognition. A well-designed harness uses all four layers effectively.
Four Types of Memory
Layer 1: Context / Session Memory
What: The current conversation or execution context window
- Duration: Single session only
- Capacity: Limited (context window size: 128K–200K tokens typical)
- Access: Immediate, included in every prompt
- Cost: Counted per token (expensive for long contexts)
- Use: Current task state, immediate history, working variables
In your harness:
- Load CLAUDE.md instructions (5K–10K tokens)
- Load compact MEMORY.md index (≤200 lines / 25K tokens)
- Load current feature/progress file (2K–5K tokens)
- Reserve remainder for session history and reasoning
Layer 2: Working Memory
What: Intermediate state during agentic loops
- Duration: One task/feature (typically 1–5 iterations)
- Capacity: Fits in context window
- Access: Built up during session from observations
- Typical size: 10K–50K tokens
In your harness:
- Tool results accumulate here as agent reasons
- Feature list (which items complete, which not)
- Current debug information, error traces
- Observation history (what happened when tool was called)
Layer 3: Persistent / Long-term Memory
What: Knowledge that survives across sessions
- Duration: Project lifetime or longer
- Capacity: Unbounded (stored in files or databases)
- Access: Loaded on demand or at session start (first 200 lines)
- Typical size: Can grow indefinitely
In your harness:
- MEMORY.md index (pointers to topic files)
- Topic files (debugging.md, api-conventions.md, architecture-decisions.md)
- Project state (progress file, feature list)
- Session transcripts (for auto-dream consolidation)
Structure:
harness-project/
├── CLAUDE.md # Instructions (loaded at startup)
├── .claude/
│ └── memory/
│ ├── MEMORY.md # Index file (loaded at startup)
│ ├── debugging.md # Debugging patterns (load on-demand)
│ ├── decisions.md # Architecture decisions
│ └── tools-api.md # Tool/API conventions
└── progress.md # Current project state
Layer 4: Auto-Dream (Consolidation)
What: Automatic memory cleanup and consolidation between sessions
- When: Runs automatically between sessions
- Trigger: 24+ hours since last cleanup AND 5+ new sessions accumulated
- Duration: Run once, write once (consolidate then resume)
What it does:
- Gather signal: Scan session transcripts for user corrections, recurring themes, decisions
- Consolidate: Merge overlapping entries, convert relative→absolute dates
- Delete contradictions: Remove facts that were later corrected
- Prune: Keep MEMORY.md under 200 lines
- Update index: Add/remove pointers to topic files
Performance (from Anthropic testing):
- Consolidated 913 sessions in 9 minutes
- Reduced hallucination by 12% through accurate memory
- Enabled agents to remember decisions from 50+ sessions back
Unified Memory Architecture Reference
This is the authoritative reference for the four-layer memory architecture. It’s used in doc 06 (Harness Architecture), doc 08 (Python implementation), and doc 09 (Operations), so it’s centralized here to ensure consistency.
Complete Four-Layer Stack
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Context / Session Memory (128K–200K tokens) │
│ ├─ Instructions (CLAUDE.md, 5-10K tokens) │
│ ├─ Memory index (MEMORY.md, 200 lines / 25K tokens) │
│ ├─ Current task state (2-5K tokens) │
│ └─ Session history (remaining context, ~50-100K tokens) │
│ [COST: Per token in every prompt] [LATENCY: Immediate] │
└─────────────────────────────────────────────────────────────┘
▲
│ (writes to)
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Working Memory (accumulates during session) │
│ ├─ Tool results and observations │
│ ├─ Feature completion state │
│ ├─ Error traces and debug info │
│ └─ Reasoning steps (10K-50K tokens per task) │
│ [COST: Counted as context] [LATENCY: Built up real-time] │
└─────────────────────────────────────────────────────────────┘
▲
│ (consolidates into)
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Persistent / Long-term Memory (files/databases) │
│ ├─ MEMORY.md index (topics) │
│ ├─ Topic files (debugging, decisions, APIs) │
│ ├─ Project state (progress, features) │
│ └─ Session transcripts (for auto-dream, unbounded) │
│ [COST: One-time load] [LATENCY: Load on-demand] │
└─────────────────────────────────────────────────────────────┘
▲
│ (periodically)
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Auto-Dream / Consolidation (background) │
│ ├─ Scan transcripts for patterns │
│ ├─ Merge duplicates, remove contradictions │
│ ├─ Convert relative dates to absolute │
│ └─ Prune MEMORY.md to <200 lines │
│ [COST: One-time background] [LATENCY: Between sessions] │
└─────────────────────────────────────────────────────────────┘
Comparison Table
| Layer | Purpose | Duration | Access Pattern | Token Cost | Example |
|---|---|---|---|---|---|
| 1: Context | Current task state | Single session | Always included | Per token | Current bug details + error stack |
| 2: Working | Intermediate reasoning | 1-5 iterations | Built during session | Per token (context) | Tool results, feature checklist |
| 3: Persistent | Cross-session knowledge | Project lifetime | Load on-demand | One-time per session | Architecture decisions, past bugs |
| 4: Auto-Dream | Memory consolidation | Between sessions | Background process | One-time consolidation | Merge redundant entries, learn patterns |
When to Use Each Layer
- Layer 1 (Context): Instructions, memory index, current task — must be <25% of context window
- Layer 2 (Working): Accumulates during reasoning loop — grows as agent runs
- Layer 3 (Persistent): Reference knowledge — loaded first 200 lines (MEMORY.md) + on-demand loads
- Layer 4 (Auto-Dream): Run after 24+ hours + 5+ sessions — automatic optimization
Token Math Example
For a 128K context window with Claude 3 Sonnet:
Context window: 128,000 tokens
Overhead:
- Instructions: -10,000 (CLAUDE.md)
- Memory index: -25,000 (MEMORY.md)
- Initialization: -5,000 (system setup)
- Buffer (10%): -12,800 (safety margin)
───────────────────────────────
Available for work: ~75,200 tokens
Split during session:
- Agent reasoning: 50,000 (working memory, layer 2)
- Tool results: 15,000 (observations)
- User input: 10,200 (remaining buffer)
Cross-document note: Docs 06, 08, and 09 reference this architecture when discussing harness components, implementation, and monitoring respectively. Keep this section as the canonical reference.
RAG: Retrieval-Augmented Generation
Definition: Augmenting LLM responses by retrieving relevant information from external sources before generation, enabling models to access and utilize data beyond their training set.
How It Works
- User asks a question
- System retrieves relevant documents from knowledge base
- Documents are embedded as vectors (numerical representations)
- Similarity search finds top-K most relevant documents
- Documents are injected into model context
- Model generates answer grounded in retrieved information
Why RAG Matters
- Real-time knowledge: Current news, real-time data
- Grounded responses: Model must cite sources, reduces hallucination
- Cost-effective: Alternative to fine-tuning (saves 60-80% of cost)
- Privacy: Keep proprietary data local, never send to model provider
- Flexibility: Update knowledge base without retraining
Example ROI
Instead of fine-tuning GPT on your 10,000 document repository:
- Fine-tuning cost: $50K–$200K
- RAG cost: ~$0.01 per query + one-time embedding ($100–$500)
- Payback: 1000 queries = ROI positive
Vector Stores
Definition: Databases optimized for storing and retrieving high-dimensional vectors (embeddings).
Architecture
Text Input
↓
Embedding Model (e.g., nomic-embed-text)
↓
Vector (768–1536 dimensions)
↓
Vector Database (Pinecone, Weaviate, Qdrant)
↓
Similarity Search (cosine distance, L2 distance)
↓
Top-K Results + Metadata
Popular Options
| Store | Hosting | Pricing | Best For |
|---|---|---|---|
| Pinecone | Cloud | Pay-per-query | Managed, production-grade |
| Weaviate | Self-hosted | Open-source | Full control, GraphQL API |
| Qdrant | Self-hosted | Open-source | High throughput, scalable |
| Milvus | Self-hosted | Open-source | Distributed, large scale |
| ChromaDB | Local | Open-source | Local development, lightweight |
KV Cache Optimization Impact on RAG
Modern KV cache techniques (GQA, INT8/INT4 quantization, PagedAttention, TurboQuant) also benefit RAG workloads:
- Longer context windows: More retrieved documents fit in the prompt
- Lower memory usage: Frees VRAM for larger batch sizes
- Higher throughput: More RAG queries processed per second
- Vector search acceleration: TurboQuant’s QJL algorithm also applies to vector similarity search, achieving superior 1@k recall ratios compared to PQ and RabbiQ baselines (tested on GloVe, d=200)
- See Doc 02 for details on specific KV cache techniques
Memory Patterns in Harnesses
Memory Pattern Used in Production Agent Systems
Session start:
1. Load CLAUDE.md files (config + instructions) — 5K tokens
2. Load MEMORY.md index (pointers) — <1K tokens
3. Load project state file — 2K tokens
4. Ready for work — ~8K tokens used
During session:
1. Agent reasons and acts
2. Tools execute, return observations
3. Session history accumulates
4. On-demand: Load topic file when specific knowledge needed
Between sessions (Auto-Dream):
if (time_since_consolidation > 24 hours AND new_sessions > 5) {
1. Scan session transcripts
2. Extract signals: corrections, patterns, decisions
3. Merge with MEMORY.md
4. Remove contradictions, old references
5. Keep index ≤ 200 lines
6. Update pointers to topic files
}
Implementation Checklist
- Create CLAUDE.md with core rules and instructions
- Set up MEMORY.md as compact index (≤200 lines)
- Create topic files: debugging.md, decisions.md, conventions.md
- Establish progress.md for tracking feature completion
- Implement session startup: load instructions + memory index
- Implement auto-dream: consolidate memory every 24h or 5 sessions
- Monitor context usage: ensure startup load <10% of context window
- Set up RAG if working with large proprietary knowledge bases
- For long contexts: Enable INT8/INT4 KV cache quantization (see Doc 02)
Context Budgeting for Your Harness
Assuming 128K context window:
Instructions (CLAUDE.md) ~5K tokens (4%)
Memory index (MEMORY.md) ~1K tokens (1%)
Project state (progress.md) ~3K tokens (2%)
─────────────────
Loaded at startup ~9K tokens (7%)
Topic files (on-demand) ~10K tokens (8%) [when needed]
Current session/loop ~30K tokens (23%) [working area]
─────────────────
Available for tool results ~79K tokens (62%) [buffer]
Rule of thumb: Keep startup load <10% so agent has 90% for actual work.
Stateless Agent Pattern: Rebuilding Context From Ground Truth
For agents that run iteratively (search, reason, search again), there is an alternative to maintaining a growing conversation: make the LLM stateless and rebuild context from ground truth on every call.
How It Works
The LLM has no memory between calls. All state lives in Python and on disk. Each call, Python renders a compact narrative from the current state, and the LLM receives it as a fresh prompt.
Call 1:
Python renders: "Subject: John Smith. No records found yet. Search FreeBMD next."
LLM responds: {"action": "search", "source": "freebmd", "query": "Smith 1842"}
Python executes the search, saves results to disk.
Call 2:
Python renders: "Subject: John Smith. FreeBMD returned 3 matches. [match details].
Evaluate which match is correct."
LLM responds: {"action": "select", "match_id": 2, "confidence": 0.85, "reason": "..."}
Python saves the selection, moves to next step.
Call 3:
Python renders: "Subject: John Smith. Selected match: born 1842, Lambeth.
Search for marriage record next."
LLM responds: {"action": "search", "source": "ancestry", "query": "Smith marriage 1860-1870"}
The context window is just a viewport into the latest state. No conversation history accumulates. No compression is needed. No context overflow occurs.
Mapping to the Four-Layer Memory Model
This pattern uses the same four layers, but allocates them differently:
| Layer | Traditional Agent | Stateless Agent |
|---|---|---|
| 1: Context | System prompt + conversation history (grows) | System prompt + rendered narrative + new results (rebuilt each call, fixed size) |
| 2: Working | Accumulates in context window | Python in-process variables (discarded after each call) |
| 3: Persistent | Files on disk, loaded on demand | JSON files on disk (the single source of truth, updated after every call) |
| 4: Consolidation | Auto-dream merges sessions | Not needed — state is already compact by design |
Why This Works
The key insight is that Layer 3 (persistent state on disk) becomes the authoritative record, not the conversation history. Python reads the current state, renders it as a compact narrative (typically 200-500 tokens), appends the latest results, and sends this as a fresh prompt. The LLM never sees the history of how the state was built — only the current state and the next decision to make.
Advantages Over Conversation-Based Agents
- No growing context: The 100th call uses the same number of tokens as the 1st
- No degradation over long runs: No summarisation artifacts, no forgotten details, no attention decay
- No context overflow: State size is controlled by Python, not by conversation length
- Deterministic replay: The same state file always produces the same prompt, making debugging trivial
- Crash recovery for free: If the process crashes, restart it — Python reads state from disk and continues from where it left off
When to Use This Pattern
Use stateless agents when:
- The task is iterative (many calls with incremental progress)
- State can be represented compactly (structured data, not open-ended conversation)
- You need to run hundreds or thousands of iterations without degradation
- Crash recovery matters (the agent must resume cleanly)
Use conversation-based agents when:
- The interaction is genuinely conversational (human in the loop, back-and-forth)
- Context from earlier turns is semantically important and hard to render from state
- The task completes in fewer than 10 turns (context growth is not an issue)
Validation Checklist
How do you know you got this right?
Performance Checks
- Startup context load <10% of context window (measured with tokenizer)
- Memory index (MEMORY.md) loads in <1 second at session start
- MEMORY.md stays under 200 lines consistently
- Auto-dream consolidation completes in <5 minutes for 5+ sessions
Implementation Checks
- Created CLAUDE.md with core instructions (5K-10K tokens)
- Created MEMORY.md as compact index with backlinks (<200 lines)
- Set up topic files: debugging.md, decisions.md, conventions.md
- Progress file tracks feature completion (JSON or markdown)
- Session startup loads CLAUDE.md + MEMORY.md, NOT all topic files
- Auto-dream runs automatically after 24h or 5+ sessions
Integration Checks
- Layer 1 (context) feeds layer 2 (working memory during loops)
- Layer 2 results consolidate into Layer 3 during auto-dream
- Layer 3 (persistent files) reload cleanly at next session start
- Memory index links are accurate (no broken pointers to topic files)
Common Failure Modes
- Context window exhaustion: Startup load >15%; trim MEMORY.md or move to topic files
- Sessions forgetting previous work: MEMORY.md not being auto-updated; check auto-dream runs
- MEMORY.md grows unbounded: No consolidation happening; implement auto-dream trigger
- Topic files loaded at startup: Should be on-demand only; move out of session initialization
Sign-Off Criteria
- Memory layers implemented and tested across 3+ sessions
- Startup latency <2 seconds even with large MEMORY.md
- Agent can reference past decisions from previous sessions
- Auto-dream merges overlapping entries and prunes stale info
- Context budget respected: work area has 75%+ free space
See Also
- Doc 05 (AI Agents): ReAct loop builds working memory during perceive-reason-act cycles
- Doc 06 (Harness Architecture): Memory management is component 3 of seven-component system
- Doc 08 (Claw-Code Python): Reference implementation of file-based memory patterns
Alternative to RAG: The LLM Wiki Pattern (Compiled Markdown Knowledge)
A pattern described by Andrej Karpathy in his LLM Wiki gist (April 4, 2026) challenges the vector database approach for small-to-medium knowledge bases. The core insight: instead of re-deriving answers from raw sources on every query, an LLM incrementally maintains a persistent wiki of compiled knowledge.
The LLM Wiki Pattern
Karpathy describes three layers:
- Raw Sources — Immutable curated documents (articles, papers, images, data files)
- The Wiki — LLM-generated markdown files with summaries, entity pages, cross-references
- The Schema — Configuration document (e.g., CLAUDE.md) defining wiki structure and conventions
Special files:
- index.md — Content-oriented catalogue of all pages organised by category with one-line summaries
- log.md — Append-only chronological record with parseable prefixes (e.g.,
## [2026-04-02] ingest | Article Title)
knowledge-base/
├── raw/ # Source material (unaltered)
│ ├── papers/
│ ├── docs/
│ └── web-clips/
└── wiki/ # LLM-compiled knowledge (structured markdown)
├── index.md # Content catalogue, entry points
├── log.md # Append-only chronological record
├── concepts/
│ ├── topic_a.md
│ ├── topic_b.md
│ └── backlinks/
└── summaries/
Three core operations (from the gist):
- Ingest: Drop a new source; LLM reads it, discusses takeaways, writes a summary, updates 10-15 wiki pages, appends a log entry
- Query: Ask questions against wiki pages; LLM synthesises answers with citations, files valuable results back as new pages
- Lint: Periodically health-check for contradictions, stale claims, orphan pages, missing cross-references
LLM role shifts: From retriever to librarian.
- Reads raw source files
- Compiles structured wiki pages (summaries, key concepts, encyclopedia-style articles)
- Maintains backlinks between related ideas
- Periodically lints the wiki (health checks, finds inconsistencies, updates stale info)
How queries work:
- Query comes in
- Look up relevant wiki page(s) using index + natural language matching
- Inject wiki content (already summarised, structured, interconnected)
- Model generates response from compiled knowledge
Why This Matters: Compiled vs. Raw
Traditional RAG:
Raw Papers → Vector Index → Similarity Search → Chunk Retrieval → Generation
(Every query re-reads papers, re-chunks, re-synthesizes)
LLM Wiki Pattern:
Raw Papers → LLM Compiles → Structured Wiki → Query Wiki → Generation
(Papers compiled once; queries run against compiled artifact)
Analogy: Source code vs. compiled binary. The key insight: compile your knowledge first.
Practical Implementation
Setup (from the gist’s recommended tooling):
- Use Obsidian Web Clipper to convert web articles to markdown
- Use Obsidian Graph View to visualise wiki connections
- Use Dataview to query page frontmatter
- Use qmd (local markdown search engine with CLI and MCP support) when the wiki outgrows simple index lookup
- Store images locally (so LLM vision can reference them)
- LLM reads raw markdown, writes structured wiki pages
- Run periodic linting passes
- Use git for version control of the markdown repository
Karpathy emphasises this is an “idea file” designed to be adapted: “everything mentioned above is optional and modular.”
Scale Limits & Trade-offs
Best for (Karpathy notes the index approach works at “small scale (~100 sources, ~hundreds of pages)”):
- Personal knowledge bases
- Small-team wikis
- Internal company wikis
- Project-specific knowledge
Why it works at this scale:
- Well-organised markdown with summaries
- Index files act as routing
- LLM can reason over entire structure
- More useful context than vector search
At larger scales: Karpathy recommends adding qmd (a local search engine with “hybrid BM25/vector search and LLM re-ranking”) for collections that outgrow the index approach.
At enterprise scale (millions of documents, strict latency):
- Traditional retrieval infrastructure (vector DBs) still necessary
- But principle still applies: compile/summarise first, then retrieve
- Hybrid: compile into chunks, then vector index the chunks
Hybrid Approach: RAG + LLM Wiki
For medium-large harnesses, combine both:
Raw Sources
↓
[LLM Compile Phase]
├─ Create wiki summaries
├─ Extract key concepts
└─ Generate backlinks
↓
Structured Wiki Markdown
↓
[Vector Index the Wiki] (not raw sources)
├─ Embed wiki pages (not raw PDFs)
├─ Index structured content
└─ Enable semantic search
↓
Query Time:
1. Try direct wiki lookup (fast)
2. If needed, semantic search in wiki index
3. Inject compiled knowledge
Implications for Your Harness
Apply LLM Wiki pattern to long-term memory:
harness-project/.claude/memory/
├── raw/
│ ├── session-transcripts/ # Auto-saved session outputs
│ ├── error-logs/ # Failures, debugging info
│ └── decisions-log/ # Raw notes on choices
└── wiki/
├── MEMORY.md # Curated index + routing
├── debugging.md # Compiled debugging patterns
├── architecture.md # Architecture decisions (compiled)
├── api-conventions.md # API/tool patterns
└── backlinks/ # Cross-references
During auto-dream consolidation:
1. Scan raw/ (session transcripts, errors, decisions)
2. Extract signals (patterns, corrections, insights)
3. Compile into wiki/ (structured markdown)
4. Update MEMORY.md index with new backlinks
5. Delete old raw sessions (keep last 10)
Benefits:
- Wiki pages are human-readable (you can edit them)
- Backlinks create mental model (connections surface)
- Auto-linting finds stale/contradictory entries
- More useful than vector search for structured reasoning
2026 Outlook
The debate: Is RAG obsolete for small-to-medium cases?
Measured view:
- Principle (compiled, structured, LLM-maintained) is correct
- Whether plain markdown is the substrate depends on scale
- Hybrid (compile + vector index) likely sweet spot
- For personal projects + small teams: pure markdown wiki wins
- For enterprise (millions docs, strict latency): retrieval infrastructure necessary
Community impact: Karpathy’s gist has sparked significant discussion. The industry is actively questioning whether traditional RAG is overengineered for small-to-medium knowledge bases.
Market Context: RAG in 2025–2026
- RAG market: Growing rapidly, with some estimates placing it at over $1B in 2024
- Trend shift: From pure RAG → compiled + structured knowledge
- Hybrid approaches: Compile first, then optionally vector index
- Cost driver: Inference (queries) > training
- 2026 outlook:
- KV cache quantization (GQA, INT8/INT4, TurboQuant 3-bit) makes inference more efficient — see Doc 02
- LLM Wiki pattern gaining traction for small-to-medium teams
- Hybrid (markdown wiki + optional vector index) emerging as standard
Citations
- Andrej Karpathy’s LLM Wiki Gist — April 4, 2026. Describes the three-layer LLM Wiki pattern (Raw Sources, Wiki, Schema) with Ingest/Query/Lint operations.
- TurboQuant: Redefining AI Efficiency with Extreme Compression — Google Research Blog, March 24, 2026. Referenced for vector search acceleration benefits relevant to RAG.