Agent Memory Patterns: Checkpoint, Resume, and State Persistence

James Phoenix

Summary

AI agents are fundamentally stateless. Every conversation starts fresh, every context window eventually expires. Agent memory patterns solve this by externalizing state to durable storage, enabling checkpoint/resume workflows, human-in-the-loop approvals, and fault-tolerant execution. This article covers three memory tiers (session, file-based, event-sourced) and their implementation patterns.

The Problem

LLMs have no inherent memory between sessions. Within a session, context accumulates until it either (a) exceeds the context window, (b) degrades through attention dilution, or (c) the session terminates. For production agents, this creates three core challenges:

Challenge 1: Human-in-the-loop workflows

An agent needs approval before deploying to production. The approval takes 30 minutes. The agent cannot wait 30 minutes with an open API connection. How does it resume after approval?

Challenge 2: Fault tolerance

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

An agent is 80% through a complex task when the process crashes. Without checkpointing, all progress is lost. The agent must restart from scratch, potentially duplicating work or creating inconsistent state.

Challenge 3: Context rot

After extended work, context quality degrades. The RALPH Loop pattern solves this by spawning fresh agents, but fresh agents have no memory of what the previous agent accomplished.

The Solution

Externalize agent state to durable storage using three complementary memory tiers:

┌─────────────────────────────────────────────────────────────────┐
│                    MEMORY HIERARCHY                              │
├─────────────────────────────────────────────────────────────────┤
│ Tier 1: Session State                                           │
│ - In-memory during execution                                     │
│ - Lost when session ends                                         │
│ - Fast, but ephemeral                                            │
├─────────────────────────────────────────────────────────────────┤
│ Tier 2: File-Based Memory                                        │
│ - TASKS.md, progress.txt, ERRORS.md                              │
│ - Survives session boundaries                                    │
│ - Git-committed, human-readable                                  │
├─────────────────────────────────────────────────────────────────┤
│ Tier 3: Event-Sourced State                                      │
│ - Append-only event log                                          │
│ - Full history reconstruction                                    │
│ - Supports time-travel debugging                                 │
└─────────────────────────────────────────────────────────────────┘

Each tier serves different durability and query needs. Most production agents use a combination.

Tier 1: Session State

Session state lives in memory during agent execution. It’s the default mode for simple agents.

Implementation

interface SessionState {
  taskId: string;
  startedAt: Date;
  currentStep: number;
  toolCallHistory: ToolCall[];
  pendingApproval: Approval | null;
  errors: Error[];
}

async function runAgent(task: string): Promise<Result<string>> {
  const state: SessionState = {
    taskId: crypto.randomUUID(),
    startedAt: new Date(),
    currentStep: 0,
    toolCallHistory: [],
    pendingApproval: null,
    errors: [],
  };

  while (state.currentStep < MAX_STEPS) {
    const toolCall = await llm.getNextAction(state);
    state.toolCallHistory.push(toolCall);

    if (requiresApproval(toolCall)) {
      // Session state cannot survive here
      // This is where Tier 2 or 3 becomes necessary
      state.pendingApproval = await requestApproval(toolCall);
    }

    const result = await executeToolCall(toolCall);
    state.currentStep++;
  }

  return { success: true, data: "completed" };
}

Limitations

Lost on process termination
Cannot pause for human approval
No fault tolerance
No progress visibility for long-running tasks

When to Use

Single-turn interactions
Sub-agents with scoped, quick tasks
Development and debugging

Tier 2: File-Based Memory

File-based memory persists state to the filesystem, enabling cross-session continuity. This is the pattern used by the RALPH Loop.

Core Files

project/
├── TASKS.md           # Task queue and completion status
├── progress.txt       # Session logs and recent activity
├── ERRORS.md          # Persistent error memory
├── LEARNINGS.md       # Accumulated insights
└── features.json      # Milestone tracking

TASKS.md Pattern

Track work items with completion status:

# Tasks

## In Progress
- [ ] Implement user authentication (started: 2026-01-28 10:00)
  - [x] Create login endpoint
  - [x] Add password hashing
  - [ ] Implement token refresh

## Completed
- [x] Set up database schema (completed: 2026-01-27)
- [x] Create user model (completed: 2026-01-27)

progress.txt Pattern

Log session activity for next-agent context:

# Progress Log

## Current Status (Updated: 2026-01-28 10:30)
- Active Task: Implement user authentication
- Last Completed: Token refresh endpoint
- Blockers: None

## Recent Activity

### 2026-01-28 10:30 - Token refresh endpoint
- What: Implemented /auth/refresh with JWT rotation
- Files: src/routes/auth.ts, src/services/token.ts
- Outcome: Success
- Next: Add rate limiting to auth endpoints

Implementation

async function runRALPHIteration(): Promise<void> {
  // 1. Load state from files
  const tasks = await readFile("TASKS.md", "utf-8");
  const progress = await readFile("progress.txt", "utf-8");

  // 2. Parse current task
  const currentTask = parseFirstIncomplete(tasks);

  // 3. Execute task
  const result = await executeTask(currentTask);

  // 4. Persist state back to files
  await updateTaskStatus(currentTask.id, result.status);
  await appendProgressEntry(result);

  // 5. Commit to git
  await exec("git add -A && git commit -m 'Progress: ${currentTask.title}'");
}

Git as Durability Layer

File-based memory becomes highly durable when combined with git:

# After each significant action
git add TASKS.md progress.txt
git commit -m "[progress]: Completed login endpoint"

# Recovery after crash
git log --oneline -5           # See recent progress
git diff HEAD~1 TASKS.md       # See what changed

Benefits

Human-readable and editable
Version-controlled history
Survives process crashes
Works with RALPH Loop pattern
No external dependencies

Limitations

No built-in concurrency handling
Query capability limited to grep/search
State reconstruction requires parsing
No transactional guarantees

Tier 3: Event-Sourced State

Event sourcing treats state as derived from an append-only log of events. This is the pattern recommended by the 12 Factor Agents framework (Factors 5 and 6).

Core Concept

Instead of storing current state, store the sequence of events that produced it:

Events:                          →  Derived State:
┌────────────────────────────┐      ┌─────────────────────┐
│ 1. task_started            │      │ status: "running"   │
│ 2. tool_called(read_file)  │      │ step: 3             │
│ 3. tool_result(success)    │  →   │ approvals: 0        │
│ 4. tool_called(edit_file)  │      │ errors: 0           │
│ 5. approval_requested      │      │ canResume: true     │
│ 6. (waiting...)            │      │ pendingApproval: {} │
└────────────────────────────┘      └─────────────────────┘

Implementation

interface AgentThread {
  id: string;
  events: AgentEvent[];
  status: "running" | "paused" | "completed" | "failed";
}

type AgentEvent =
  | { type: "task_started"; task: string; timestamp: Date }
  | { type: "tool_called"; tool: string; params: unknown; timestamp: Date }
  | { type: "tool_result"; result: unknown; timestamp: Date }
  | { type: "approval_requested"; action: string; timestamp: Date }
  | { type: "approval_granted"; by: string; timestamp: Date }
  | { type: "approval_denied"; by: string; reason: string; timestamp: Date }
  | { type: "error"; error: string; timestamp: Date }
  | { type: "task_completed"; result: unknown; timestamp: Date };

// State is DERIVED from events, never stored directly
function deriveState(thread: AgentThread): ExecutionState {
  const events = thread.events;

  return {
    currentStep: events.filter(e => e.type === "tool_result").length,
    pendingApprovals: events.filter(e =>
      e.type === "approval_requested" &&
      !events.find(a =>
        (a.type === "approval_granted" || a.type === "approval_denied") &&
        a.timestamp > e.timestamp
      )
    ),
    errors: events.filter(e => e.type === "error"),
    lastEvent: events[events.length - 1],
    canResume: thread.status === "paused",
  };
}

Checkpoint/Resume Pattern

The key pattern for human-in-the-loop workflows:

class ResumableAgent {
  private db: ThreadStore;

  async launch(task: string): Promise<AgentThread> {
    const thread: AgentThread = {
      id: crypto.randomUUID(),
      events: [{ type: "task_started", task, timestamp: new Date() }],
      status: "running",
    };

    await this.db.saveThread(thread);
    return this.run(thread);
  }

  async pause(threadId: string, reason: string): Promise<void> {
    const thread = await this.db.getThread(threadId);
    thread.events.push({
      type: "paused",
      reason,
      timestamp: new Date()
    });
    thread.status = "paused";
    await this.db.saveThread(thread);
  }

  async resume(threadId: string, feedback?: string): Promise<AgentThread> {
    const thread = await this.db.getThread(threadId);

    if (feedback) {
      thread.events.push({
        type: "human_feedback",
        content: feedback,
        timestamp: new Date()
      });
    }

    thread.events.push({
      type: "resumed",
      timestamp: new Date()
    });
    thread.status = "running";

    return this.run(thread);
  }

  private async run(thread: AgentThread): Promise<AgentThread> {
    while (thread.status === "running") {
      // Rebuild context from events
      const state = deriveState(thread);
      const context = buildContextFromEvents(thread.events);

      // Get next action
      const toolCall = await this.llm.getNextAction(context);
      thread.events.push({
        type: "tool_called",
        tool: toolCall.name,
        params: toolCall.params,
        timestamp: new Date()
      });

      // Handle approval-required tools
      if (requiresApproval(toolCall)) {
        thread.events.push({
          type: "approval_requested",
          action: toolCall.name,
          timestamp: new Date()
        });
        thread.status = "paused";
        await this.db.saveThread(thread);

        // Notify human and return - execution stops here
        await this.notifyHuman(thread.id, toolCall);
        return thread;
      }

      // Execute tool
      const result = await executeToolCall(toolCall);
      thread.events.push({
        type: "tool_result",
        result,
        timestamp: new Date()
      });

      // Checkpoint after each tool call
      await this.db.saveThread(thread);

      // Check for completion
      if (isComplete(result)) {
        thread.events.push({
          type: "task_completed",
          result,
          timestamp: new Date()
        });
        thread.status = "completed";
      }
    }

    await this.db.saveThread(thread);
    return thread;
  }
}

Webhook Integration

Enable external systems to resume agents:

// Express endpoint for webhook-triggered resume
app.post("/webhook/resume/:threadId", async (req, res) => {
  const { threadId } = req.params;
  const { approved, feedback, approver } = req.body;

  const thread = await db.getThread(threadId);

  if (approved) {
    thread.events.push({
      type: "approval_granted",
      by: approver,
      timestamp: new Date(),
    });

    // Resume in background, return immediately
    agent.resume(threadId, feedback).catch(console.error);

    res.json({ status: "resuming", threadId });
  } else {
    thread.events.push({
      type: "approval_denied",
      by: approver,
      reason: feedback,
      timestamp: new Date(),
    });
    thread.status = "failed";
    await db.saveThread(thread);

    res.json({ status: "denied", threadId });
  }
});

Benefits

Complete audit trail
Time-travel debugging (replay events)
Natural pause/resume support
State reconstruction from any point
Concurrent access safety (append-only)

Limitations

More complex implementation
Requires external storage (database, file, etc.)
Event schema evolution needs care
Higher storage requirements

Combining Memory Tiers

Production agents often combine tiers:

class ProductionAgent {
  // Tier 3: Event log for audit and resume
  private eventStore: EventStore;

  // Tier 2: File-based for human visibility
  private progressFile: string = "progress.txt";
  private tasksFile: string = "TASKS.md";

  async afterToolCall(thread: AgentThread, toolCall: ToolCall): Promise<void> {
    // Append to event store (Tier 3)
    await this.eventStore.append(thread.id, {
      type: "tool_called",
      tool: toolCall.name,
      timestamp: new Date(),
    });

    // Update progress file (Tier 2)
    await appendFile(this.progressFile,
      `### ${new Date().toISOString()}\n- Tool: ${toolCall.name}\n`
    );

    // Commit checkpoint to git
    if (isSignificantAction(toolCall)) {
      await exec(`git add ${this.progressFile} && git commit -m "Progress: ${toolCall.name}"`);
    }
  }
}

Memory Patterns for RALPH Loop

The RALPH Loop specifically uses Tier 2 (file-based) memory because:

Each iteration spawns a fresh agent (no session state)
Memory must survive process boundaries
Humans need to inspect and edit state
Git provides durability and history

Implementing RALPH-Compatible Memory

interface RALPHState {
  currentTask: string | null;
  completedTasks: string[];
  blockers: string[];
  recentActivity: ActivityEntry[];
}

async function loadRALPHState(): Promise<RALPHState> {
  const tasks = await readFile("TASKS.md", "utf-8");
  const progress = await readFile("progress.txt", "utf-8");

  return {
    currentTask: parseCurrentTask(tasks),
    completedTasks: parseCompletedTasks(tasks),
    blockers: parseBlockers(tasks),
    recentActivity: parseRecentActivity(progress),
  };
}

async function saveRALPHState(state: RALPHState): Promise<void> {
  // Update TASKS.md with task status
  await updateTasksFile(state.completedTasks, state.currentTask);

  // Append to progress.txt
  await appendProgressEntry(state.recentActivity[0]);

  // Commit to git
  await exec("git add TASKS.md progress.txt && git commit -m 'RALPH iteration complete'");
}

Best Practices

1. Checkpoint After Every Tool Call

// Good: Checkpoint after each action
for (const toolCall of toolCalls) {
  const result = await execute(toolCall);
  await checkpoint(thread);  // Survives crash
}

// Bad: Checkpoint only at end
for (const toolCall of toolCalls) {
  const result = await execute(toolCall);
}
await checkpoint(thread);  // Crash loses all progress

2. Separate Event Log from Derived State

// Good: Events are source of truth
const events = await loadEvents(threadId);
const state = deriveState(events);

// Bad: Store both (can become inconsistent)
const { events, state } = await loadThread(threadId);

3. Include Enough Context for Resume

// Good: Event contains all needed context
{
  type: "tool_called",
  tool: "edit_file",
  params: { path: "src/auth.ts", content: "..." },
  context: { reason: "Adding JWT validation" },
  timestamp: new Date(),
}

// Bad: Event lacks context
{
  type: "tool_called",
  tool: "edit_file",
}

4. Handle Stale Checkpoints

async function resume(threadId: string): Promise<AgentThread> {
  const thread = await db.getThread(threadId);
  const age = Date.now() - thread.events.at(-1).timestamp;

  if (age > STALE_THRESHOLD) {
    // Checkpoint is old - re-verify state
    const currentState = await verifyExternalState();
    thread.events.push({
      type: "state_verified",
      verified: currentState,
      timestamp: new Date(),
    });
  }

  return run(thread);
}

5. Version Your Event Schema

interface AgentEvent {
  schemaVersion: 2;  // Increment on breaking changes
  type: string;
  timestamp: Date;
  // ... rest of event
}

function migrateEvent(event: AgentEvent): AgentEvent {
  if (event.schemaVersion === 1) {
    // Migrate v1 to v2
    return { ...event, schemaVersion: 2, newField: defaultValue };
  }
  return event;
}

Common Pitfalls

Pitfall 1: Storing Derived State

// Bad: Storing derived state leads to inconsistency
await db.save({
  events: [...events, newEvent],
  state: { ...state, stepCount: state.stepCount + 1 },  // Can drift
});

// Good: Derive state from events
await db.save({ events: [...events, newEvent] });
const state = deriveState(events);

Pitfall 2: Missing Failure Events

// Bad: Errors not captured in event log
try {
  await executeToolCall(toolCall);
} catch (error) {
  console.error(error);  // Lost
}

// Good: Errors are first-class events
try {
  await executeToolCall(toolCall);
} catch (error) {
  thread.events.push({
    type: "error",
    error: error.message,
    stack: error.stack,
    timestamp: new Date(),
  });
}

Pitfall 3: Unbounded Event Logs

// Bad: Events grow forever
thread.events.push(newEvent);

// Good: Compact old events periodically
if (thread.events.length > MAX_EVENTS) {
  const snapshot = deriveState(thread.events);
  thread.events = [
    { type: "snapshot", state: snapshot, timestamp: new Date() },
    ...thread.events.slice(-RECENT_EVENT_COUNT),
  ];
}

Pitfall 4: Blocking on Human Approval

// Bad: Process blocks waiting for human
const approval = await waitForApproval(toolCall);  // Blocks indefinitely

// Good: Persist and exit, resume via webhook
await requestApproval(toolCall);
thread.status = "paused";
await db.save(thread);
return thread;  // Process exits, webhook resumes

12 Factor Agents – Factors 5 and 6 define event-based state and pause/resume
The RALPH Loop – File-based memory for fresh-context iterations
Clean Slate Trajectory Recovery – When to abandon session state
Institutional Memory with Learning Files – Long-term decision memory
Event Sourcing for Agents – Deep dive on event-sourced architecture
Human in the Loop Patterns – Approval workflows using checkpoint/resume

References

HumanLayer: 12 Factor Agents – Original 12 Factor Agents article
Event Sourcing Pattern – Martin Fowler on event sourcing
Anthropic: Effective Harnesses – Long-running agent patterns

Agent Memory Patterns: Checkpoint, Resume, and State Persistence

Summary

The Problem

Learn Prompt Engineering

The Solution

Tier 1: Session State

Implementation

Limitations

When to Use

Tier 2: File-Based Memory

Core Files

TASKS.md Pattern

progress.txt Pattern

Implementation

Git as Durability Layer

Benefits

Limitations

Tier 3: Event-Sourced State

Core Concept

Implementation

Checkpoint/Resume Pattern

Webhook Integration

Benefits

Limitations

Combining Memory Tiers

Memory Patterns for RALPH Loop

Implementing RALPH-Compatible Memory

Best Practices

1. Checkpoint After Every Tool Call

2. Separate Event Log from Derived State

3. Include Enough Context for Resume

4. Handle Stale Checkpoints

5. Version Your Event Schema

Common Pitfalls

Pitfall 1: Storing Derived State

Pitfall 2: Missing Failure Events

Pitfall 3: Unbounded Event Logs

Pitfall 4: Blocking on Human Approval

Related

References

More Insights

Own Your Control Plane

Indexed PRD and Design Doc Strategy