Summary
AI agents are fundamentally stateless. Every conversation starts fresh, every context window eventually expires. Agent memory patterns solve this by externalizing state to durable storage, enabling checkpoint/resume workflows, human-in-the-loop approvals, and fault-tolerant execution. This article covers three memory tiers (session, file-based, event-sourced) and their implementation patterns.
The Problem
LLMs have no inherent memory between sessions. Within a session, context accumulates until it either (a) exceeds the context window, (b) degrades through attention dilution, or (c) the session terminates. For production agents, this creates three core challenges:
Challenge 1: Human-in-the-loop workflows
An agent needs approval before deploying to production. The approval takes 30 minutes. The agent cannot wait 30 minutes with an open API connection. How does it resume after approval?
Challenge 2: Fault tolerance
An agent is 80% through a complex task when the process crashes. Without checkpointing, all progress is lost. The agent must restart from scratch, potentially duplicating work or creating inconsistent state.
Challenge 3: Context rot
After extended work, context quality degrades. The RALPH Loop pattern solves this by spawning fresh agents, but fresh agents have no memory of what the previous agent accomplished.
The Solution
Externalize agent state to durable storage using three complementary memory tiers:
┌─────────────────────────────────────────────────────────────────┐
│ MEMORY HIERARCHY │
├─────────────────────────────────────────────────────────────────┤
│ Tier 1: Session State │
│ - In-memory during execution │
│ - Lost when session ends │
│ - Fast, but ephemeral │
├─────────────────────────────────────────────────────────────────┤
│ Tier 2: File-Based Memory │
│ - TASKS.md, progress.txt, ERRORS.md │
│ - Survives session boundaries │
│ - Git-committed, human-readable │
├─────────────────────────────────────────────────────────────────┤
│ Tier 3: Event-Sourced State │
│ - Append-only event log │
│ - Full history reconstruction │
│ - Supports time-travel debugging │
└─────────────────────────────────────────────────────────────────┘
Each tier serves different durability and query needs. Most production agents use a combination.
Tier 1: Session State
Session state lives in memory during agent execution. It’s the default mode for simple agents.
Implementation
interface SessionState {
taskId: string;
startedAt: Date;
currentStep: number;
toolCallHistory: ToolCall[];
pendingApproval: Approval | null;
errors: Error[];
}
async function runAgent(task: string): Promise<Result<string>> {
const state: SessionState = {
taskId: crypto.randomUUID(),
startedAt: new Date(),
currentStep: 0,
toolCallHistory: [],
pendingApproval: null,
errors: [],
};
while (state.currentStep < MAX_STEPS) {
const toolCall = await llm.getNextAction(state);
state.toolCallHistory.push(toolCall);
if (requiresApproval(toolCall)) {
// Session state cannot survive here
// This is where Tier 2 or 3 becomes necessary
state.pendingApproval = await requestApproval(toolCall);
}
const result = await executeToolCall(toolCall);
state.currentStep++;
}
return { success: true, data: "completed" };
}
Limitations
- Lost on process termination
- Cannot pause for human approval
- No fault tolerance
- No progress visibility for long-running tasks
When to Use
- Single-turn interactions
- Sub-agents with scoped, quick tasks
- Development and debugging
Tier 2: File-Based Memory
File-based memory persists state to the filesystem, enabling cross-session continuity. This is the pattern used by the RALPH Loop.
Core Files
project/
├── TASKS.md # Task queue and completion status
├── progress.txt # Session logs and recent activity
├── ERRORS.md # Persistent error memory
├── LEARNINGS.md # Accumulated insights
└── features.json # Milestone tracking
TASKS.md Pattern
Track work items with completion status:
# Tasks
## In Progress
- [ ] Implement user authentication (started: 2026-01-28 10:00)
- [x] Create login endpoint
- [x] Add password hashing
- [ ] Implement token refresh
## Completed
- [x] Set up database schema (completed: 2026-01-27)
- [x] Create user model (completed: 2026-01-27)
progress.txt Pattern
Log session activity for next-agent context:
# Progress Log
## Current Status (Updated: 2026-01-28 10:30)
- Active Task: Implement user authentication
- Last Completed: Token refresh endpoint
- Blockers: None
## Recent Activity
### 2026-01-28 10:30 - Token refresh endpoint
- What: Implemented /auth/refresh with JWT rotation
- Files: src/routes/auth.ts, src/services/token.ts
- Outcome: Success
- Next: Add rate limiting to auth endpoints
Implementation
async function runRALPHIteration(): Promise<void> {
// 1. Load state from files
const tasks = await readFile("TASKS.md", "utf-8");
const progress = await readFile("progress.txt", "utf-8");
// 2. Parse current task
const currentTask = parseFirstIncomplete(tasks);
// 3. Execute task
const result = await executeTask(currentTask);
// 4. Persist state back to files
await updateTaskStatus(currentTask.id, result.status);
await appendProgressEntry(result);
// 5. Commit to git
await exec("git add -A && git commit -m 'Progress: ${currentTask.title}'");
}
Git as Durability Layer
File-based memory becomes highly durable when combined with git:
# After each significant action
git add TASKS.md progress.txt
git commit -m "[progress]: Completed login endpoint"
# Recovery after crash
git log --oneline -5 # See recent progress
git diff HEAD~1 TASKS.md # See what changed
Benefits
- Human-readable and editable
- Version-controlled history
- Survives process crashes
- Works with RALPH Loop pattern
- No external dependencies
Limitations
- No built-in concurrency handling
- Query capability limited to grep/search
- State reconstruction requires parsing
- No transactional guarantees
Tier 3: Event-Sourced State
Event sourcing treats state as derived from an append-only log of events. This is the pattern recommended by the 12 Factor Agents framework (Factors 5 and 6).
Core Concept
Instead of storing current state, store the sequence of events that produced it:
Events: → Derived State:
┌────────────────────────────┐ ┌─────────────────────┐
│ 1. task_started │ │ status: "running" │
│ 2. tool_called(read_file) │ │ step: 3 │
│ 3. tool_result(success) │ → │ approvals: 0 │
│ 4. tool_called(edit_file) │ │ errors: 0 │
│ 5. approval_requested │ │ canResume: true │
│ 6. (waiting...) │ │ pendingApproval: {} │
└────────────────────────────┘ └─────────────────────┘
Implementation
interface AgentThread {
id: string;
events: AgentEvent[];
status: "running" | "paused" | "completed" | "failed";
}
type AgentEvent =
| { type: "task_started"; task: string; timestamp: Date }
| { type: "tool_called"; tool: string; params: unknown; timestamp: Date }
| { type: "tool_result"; result: unknown; timestamp: Date }
| { type: "approval_requested"; action: string; timestamp: Date }
| { type: "approval_granted"; by: string; timestamp: Date }
| { type: "approval_denied"; by: string; reason: string; timestamp: Date }
| { type: "error"; error: string; timestamp: Date }
| { type: "task_completed"; result: unknown; timestamp: Date };
// State is DERIVED from events, never stored directly
function deriveState(thread: AgentThread): ExecutionState {
const events = thread.events;
return {
currentStep: events.filter(e => e.type === "tool_result").length,
pendingApprovals: events.filter(e =>
e.type === "approval_requested" &&
!events.find(a =>
(a.type === "approval_granted" || a.type === "approval_denied") &&
a.timestamp > e.timestamp
)
),
errors: events.filter(e => e.type === "error"),
lastEvent: events[events.length - 1],
canResume: thread.status === "paused",
};
}
Checkpoint/Resume Pattern
The key pattern for human-in-the-loop workflows:
class ResumableAgent {
private db: ThreadStore;
async launch(task: string): Promise<AgentThread> {
const thread: AgentThread = {
id: crypto.randomUUID(),
events: [{ type: "task_started", task, timestamp: new Date() }],
status: "running",
};
await this.db.saveThread(thread);
return this.run(thread);
}
async pause(threadId: string, reason: string): Promise<void> {
const thread = await this.db.getThread(threadId);
thread.events.push({
type: "paused",
reason,
timestamp: new Date()
});
thread.status = "paused";
await this.db.saveThread(thread);
}
async resume(threadId: string, feedback?: string): Promise<AgentThread> {
const thread = await this.db.getThread(threadId);
if (feedback) {
thread.events.push({
type: "human_feedback",
content: feedback,
timestamp: new Date()
});
}
thread.events.push({
type: "resumed",
timestamp: new Date()
});
thread.status = "running";
return this.run(thread);
}
private async run(thread: AgentThread): Promise<AgentThread> {
while (thread.status === "running") {
// Rebuild context from events
const state = deriveState(thread);
const context = buildContextFromEvents(thread.events);
// Get next action
const toolCall = await this.llm.getNextAction(context);
thread.events.push({
type: "tool_called",
tool: toolCall.name,
params: toolCall.params,
timestamp: new Date()
});
// Handle approval-required tools
if (requiresApproval(toolCall)) {
thread.events.push({
type: "approval_requested",
action: toolCall.name,
timestamp: new Date()
});
thread.status = "paused";
await this.db.saveThread(thread);
// Notify human and return - execution stops here
await this.notifyHuman(thread.id, toolCall);
return thread;
}
// Execute tool
const result = await executeToolCall(toolCall);
thread.events.push({
type: "tool_result",
result,
timestamp: new Date()
});
// Checkpoint after each tool call
await this.db.saveThread(thread);
// Check for completion
if (isComplete(result)) {
thread.events.push({
type: "task_completed",
result,
timestamp: new Date()
});
thread.status = "completed";
}
}
await this.db.saveThread(thread);
return thread;
}
}
Webhook Integration
Enable external systems to resume agents:
// Express endpoint for webhook-triggered resume
app.post("/webhook/resume/:threadId", async (req, res) => {
const { threadId } = req.params;
const { approved, feedback, approver } = req.body;
const thread = await db.getThread(threadId);
if (approved) {
thread.events.push({
type: "approval_granted",
by: approver,
timestamp: new Date(),
});
// Resume in background, return immediately
agent.resume(threadId, feedback).catch(console.error);
res.json({ status: "resuming", threadId });
} else {
thread.events.push({
type: "approval_denied",
by: approver,
reason: feedback,
timestamp: new Date(),
});
thread.status = "failed";
await db.saveThread(thread);
res.json({ status: "denied", threadId });
}
});
Benefits
- Complete audit trail
- Time-travel debugging (replay events)
- Natural pause/resume support
- State reconstruction from any point
- Concurrent access safety (append-only)
Limitations
- More complex implementation
- Requires external storage (database, file, etc.)
- Event schema evolution needs care
- Higher storage requirements
Combining Memory Tiers
Production agents often combine tiers:
class ProductionAgent {
// Tier 3: Event log for audit and resume
private eventStore: EventStore;
// Tier 2: File-based for human visibility
private progressFile: string = "progress.txt";
private tasksFile: string = "TASKS.md";
async afterToolCall(thread: AgentThread, toolCall: ToolCall): Promise<void> {
// Append to event store (Tier 3)
await this.eventStore.append(thread.id, {
type: "tool_called",
tool: toolCall.name,
timestamp: new Date(),
});
// Update progress file (Tier 2)
await appendFile(this.progressFile,
`### ${new Date().toISOString()}\n- Tool: ${toolCall.name}\n`
);
// Commit checkpoint to git
if (isSignificantAction(toolCall)) {
await exec(`git add ${this.progressFile} && git commit -m "Progress: ${toolCall.name}"`);
}
}
}
Memory Patterns for RALPH Loop
The RALPH Loop specifically uses Tier 2 (file-based) memory because:
- Each iteration spawns a fresh agent (no session state)
- Memory must survive process boundaries
- Humans need to inspect and edit state
- Git provides durability and history
Implementing RALPH-Compatible Memory
interface RALPHState {
currentTask: string | null;
completedTasks: string[];
blockers: string[];
recentActivity: ActivityEntry[];
}
async function loadRALPHState(): Promise<RALPHState> {
const tasks = await readFile("TASKS.md", "utf-8");
const progress = await readFile("progress.txt", "utf-8");
return {
currentTask: parseCurrentTask(tasks),
completedTasks: parseCompletedTasks(tasks),
blockers: parseBlockers(tasks),
recentActivity: parseRecentActivity(progress),
};
}
async function saveRALPHState(state: RALPHState): Promise<void> {
// Update TASKS.md with task status
await updateTasksFile(state.completedTasks, state.currentTask);
// Append to progress.txt
await appendProgressEntry(state.recentActivity[0]);
// Commit to git
await exec("git add TASKS.md progress.txt && git commit -m 'RALPH iteration complete'");
}
Best Practices
1. Checkpoint After Every Tool Call
// Good: Checkpoint after each action
for (const toolCall of toolCalls) {
const result = await execute(toolCall);
await checkpoint(thread); // Survives crash
}
// Bad: Checkpoint only at end
for (const toolCall of toolCalls) {
const result = await execute(toolCall);
}
await checkpoint(thread); // Crash loses all progress
2. Separate Event Log from Derived State
// Good: Events are source of truth
const events = await loadEvents(threadId);
const state = deriveState(events);
// Bad: Store both (can become inconsistent)
const { events, state } = await loadThread(threadId);
3. Include Enough Context for Resume
// Good: Event contains all needed context
{
type: "tool_called",
tool: "edit_file",
params: { path: "src/auth.ts", content: "..." },
context: { reason: "Adding JWT validation" },
timestamp: new Date(),
}
// Bad: Event lacks context
{
type: "tool_called",
tool: "edit_file",
}
4. Handle Stale Checkpoints
async function resume(threadId: string): Promise<AgentThread> {
const thread = await db.getThread(threadId);
const age = Date.now() - thread.events.at(-1).timestamp;
if (age > STALE_THRESHOLD) {
// Checkpoint is old - re-verify state
const currentState = await verifyExternalState();
thread.events.push({
type: "state_verified",
verified: currentState,
timestamp: new Date(),
});
}
return run(thread);
}
5. Version Your Event Schema
interface AgentEvent {
schemaVersion: 2; // Increment on breaking changes
type: string;
timestamp: Date;
// ... rest of event
}
function migrateEvent(event: AgentEvent): AgentEvent {
if (event.schemaVersion === 1) {
// Migrate v1 to v2
return { ...event, schemaVersion: 2, newField: defaultValue };
}
return event;
}
Common Pitfalls
Pitfall 1: Storing Derived State
// Bad: Storing derived state leads to inconsistency
await db.save({
events: [...events, newEvent],
state: { ...state, stepCount: state.stepCount + 1 }, // Can drift
});
// Good: Derive state from events
await db.save({ events: [...events, newEvent] });
const state = deriveState(events);
Pitfall 2: Missing Failure Events
// Bad: Errors not captured in event log
try {
await executeToolCall(toolCall);
} catch (error) {
console.error(error); // Lost
}
// Good: Errors are first-class events
try {
await executeToolCall(toolCall);
} catch (error) {
thread.events.push({
type: "error",
error: error.message,
stack: error.stack,
timestamp: new Date(),
});
}
Pitfall 3: Unbounded Event Logs
// Bad: Events grow forever
thread.events.push(newEvent);
// Good: Compact old events periodically
if (thread.events.length > MAX_EVENTS) {
const snapshot = deriveState(thread.events);
thread.events = [
{ type: "snapshot", state: snapshot, timestamp: new Date() },
...thread.events.slice(-RECENT_EVENT_COUNT),
];
}
Pitfall 4: Blocking on Human Approval
// Bad: Process blocks waiting for human
const approval = await waitForApproval(toolCall); // Blocks indefinitely
// Good: Persist and exit, resume via webhook
await requestApproval(toolCall);
thread.status = "paused";
await db.save(thread);
return thread; // Process exits, webhook resumes
Related
- 12 Factor Agents – Factors 5 and 6 define event-based state and pause/resume
- The RALPH Loop – File-based memory for fresh-context iterations
- Clean Slate Trajectory Recovery – When to abandon session state
- Institutional Memory with Learning Files – Long-term decision memory
- Event Sourcing for Agents – Deep dive on event-sourced architecture
- Human in the Loop Patterns – Approval workflows using checkpoint/resume
References
- HumanLayer: 12 Factor Agents – Original 12 Factor Agents article
- Event Sourcing Pattern – Martin Fowler on event sourcing
- Anthropic: Effective Harnesses – Long-running agent patterns

