Agent Memory Patterns: Checkpoint, Resume, and State Persistence

James Phoenix
James Phoenix

Summary

AI agents are fundamentally stateless. Every conversation starts fresh, every context window eventually expires. Agent memory patterns solve this by externalizing state to durable storage, enabling checkpoint/resume workflows, human-in-the-loop approvals, and fault-tolerant execution. This article covers three memory tiers (session, file-based, event-sourced) and their implementation patterns.

The Problem

LLMs have no inherent memory between sessions. Within a session, context accumulates until it either (a) exceeds the context window, (b) degrades through attention dilution, or (c) the session terminates. For production agents, this creates three core challenges:

Challenge 1: Human-in-the-loop workflows

An agent needs approval before deploying to production. The approval takes 30 minutes. The agent cannot wait 30 minutes with an open API connection. How does it resume after approval?

Challenge 2: Fault tolerance

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

An agent is 80% through a complex task when the process crashes. Without checkpointing, all progress is lost. The agent must restart from scratch, potentially duplicating work or creating inconsistent state.

Challenge 3: Context rot

After extended work, context quality degrades. The RALPH Loop pattern solves this by spawning fresh agents, but fresh agents have no memory of what the previous agent accomplished.

The Solution

Externalize agent state to durable storage using three complementary memory tiers:

┌─────────────────────────────────────────────────────────────────┐
│                    MEMORY HIERARCHY                              │
├─────────────────────────────────────────────────────────────────┤
│ Tier 1: Session State                                           │
│ - In-memory during execution                                     │
│ - Lost when session ends                                         │
│ - Fast, but ephemeral                                            │
├─────────────────────────────────────────────────────────────────┤
│ Tier 2: File-Based Memory                                        │
│ - TASKS.md, progress.txt, ERRORS.md                              │
│ - Survives session boundaries                                    │
│ - Git-committed, human-readable                                  │
├─────────────────────────────────────────────────────────────────┤
│ Tier 3: Event-Sourced State                                      │
│ - Append-only event log                                          │
│ - Full history reconstruction                                    │
│ - Supports time-travel debugging                                 │
└─────────────────────────────────────────────────────────────────┘

Each tier serves different durability and query needs. Most production agents use a combination.

Tier 1: Session State

Session state lives in memory during agent execution. It’s the default mode for simple agents.

Implementation

interface SessionState {
  taskId: string;
  startedAt: Date;
  currentStep: number;
  toolCallHistory: ToolCall[];
  pendingApproval: Approval | null;
  errors: Error[];
}

async function runAgent(task: string): Promise<Result<string>> {
  const state: SessionState = {
    taskId: crypto.randomUUID(),
    startedAt: new Date(),
    currentStep: 0,
    toolCallHistory: [],
    pendingApproval: null,
    errors: [],
  };

  while (state.currentStep < MAX_STEPS) {
    const toolCall = await llm.getNextAction(state);
    state.toolCallHistory.push(toolCall);

    if (requiresApproval(toolCall)) {
      // Session state cannot survive here
      // This is where Tier 2 or 3 becomes necessary
      state.pendingApproval = await requestApproval(toolCall);
    }

    const result = await executeToolCall(toolCall);
    state.currentStep++;
  }

  return { success: true, data: "completed" };
}

Limitations

  • Lost on process termination
  • Cannot pause for human approval
  • No fault tolerance
  • No progress visibility for long-running tasks

When to Use

  • Single-turn interactions
  • Sub-agents with scoped, quick tasks
  • Development and debugging

Tier 2: File-Based Memory

File-based memory persists state to the filesystem, enabling cross-session continuity. This is the pattern used by the RALPH Loop.

Core Files

project/
├── TASKS.md           # Task queue and completion status
├── progress.txt       # Session logs and recent activity
├── ERRORS.md          # Persistent error memory
├── LEARNINGS.md       # Accumulated insights
└── features.json      # Milestone tracking

TASKS.md Pattern

Track work items with completion status:

# Tasks

## In Progress
- [ ] Implement user authentication (started: 2026-01-28 10:00)
  - [x] Create login endpoint
  - [x] Add password hashing
  - [ ] Implement token refresh

## Completed
- [x] Set up database schema (completed: 2026-01-27)
- [x] Create user model (completed: 2026-01-27)

progress.txt Pattern

Log session activity for next-agent context:

# Progress Log

## Current Status (Updated: 2026-01-28 10:30)
- Active Task: Implement user authentication
- Last Completed: Token refresh endpoint
- Blockers: None

## Recent Activity

### 2026-01-28 10:30 - Token refresh endpoint
- What: Implemented /auth/refresh with JWT rotation
- Files: src/routes/auth.ts, src/services/token.ts
- Outcome: Success
- Next: Add rate limiting to auth endpoints

Implementation

async function runRALPHIteration(): Promise<void> {
  // 1. Load state from files
  const tasks = await readFile("TASKS.md", "utf-8");
  const progress = await readFile("progress.txt", "utf-8");

  // 2. Parse current task
  const currentTask = parseFirstIncomplete(tasks);

  // 3. Execute task
  const result = await executeTask(currentTask);

  // 4. Persist state back to files
  await updateTaskStatus(currentTask.id, result.status);
  await appendProgressEntry(result);

  // 5. Commit to git
  await exec("git add -A && git commit -m 'Progress: ${currentTask.title}'");
}

Git as Durability Layer

File-based memory becomes highly durable when combined with git:

# After each significant action
git add TASKS.md progress.txt
git commit -m "[progress]: Completed login endpoint"

# Recovery after crash
git log --oneline -5           # See recent progress
git diff HEAD~1 TASKS.md       # See what changed

Benefits

  • Human-readable and editable
  • Version-controlled history
  • Survives process crashes
  • Works with RALPH Loop pattern
  • No external dependencies

Limitations

  • No built-in concurrency handling
  • Query capability limited to grep/search
  • State reconstruction requires parsing
  • No transactional guarantees

Tier 3: Event-Sourced State

Event sourcing treats state as derived from an append-only log of events. This is the pattern recommended by the 12 Factor Agents framework (Factors 5 and 6).

Core Concept

Instead of storing current state, store the sequence of events that produced it:

Events:                            Derived State:
┌────────────────────────────┐      ┌─────────────────────┐
 1. task_started                   status: "running"   
 2. tool_called(read_file)         step: 3             
 3. tool_result(success)          approvals: 0        
 4. tool_called(edit_file)         errors: 0           
 5. approval_requested             canResume: true     
 6. (waiting...)                   pendingApproval: {} 
└────────────────────────────┘      └─────────────────────┘

Implementation

interface AgentThread {
  id: string;
  events: AgentEvent[];
  status: "running" | "paused" | "completed" | "failed";
}

type AgentEvent =
  | { type: "task_started"; task: string; timestamp: Date }
  | { type: "tool_called"; tool: string; params: unknown; timestamp: Date }
  | { type: "tool_result"; result: unknown; timestamp: Date }
  | { type: "approval_requested"; action: string; timestamp: Date }
  | { type: "approval_granted"; by: string; timestamp: Date }
  | { type: "approval_denied"; by: string; reason: string; timestamp: Date }
  | { type: "error"; error: string; timestamp: Date }
  | { type: "task_completed"; result: unknown; timestamp: Date };

// State is DERIVED from events, never stored directly
function deriveState(thread: AgentThread): ExecutionState {
  const events = thread.events;

  return {
    currentStep: events.filter(e => e.type === "tool_result").length,
    pendingApprovals: events.filter(e =>
      e.type === "approval_requested" &&
      !events.find(a =>
        (a.type === "approval_granted" || a.type === "approval_denied") &&
        a.timestamp > e.timestamp
      )
    ),
    errors: events.filter(e => e.type === "error"),
    lastEvent: events[events.length - 1],
    canResume: thread.status === "paused",
  };
}

Checkpoint/Resume Pattern

The key pattern for human-in-the-loop workflows:

class ResumableAgent {
  private db: ThreadStore;

  async launch(task: string): Promise<AgentThread> {
    const thread: AgentThread = {
      id: crypto.randomUUID(),
      events: [{ type: "task_started", task, timestamp: new Date() }],
      status: "running",
    };

    await this.db.saveThread(thread);
    return this.run(thread);
  }

  async pause(threadId: string, reason: string): Promise<void> {
    const thread = await this.db.getThread(threadId);
    thread.events.push({
      type: "paused",
      reason,
      timestamp: new Date()
    });
    thread.status = "paused";
    await this.db.saveThread(thread);
  }

  async resume(threadId: string, feedback?: string): Promise<AgentThread> {
    const thread = await this.db.getThread(threadId);

    if (feedback) {
      thread.events.push({
        type: "human_feedback",
        content: feedback,
        timestamp: new Date()
      });
    }

    thread.events.push({
      type: "resumed",
      timestamp: new Date()
    });
    thread.status = "running";

    return this.run(thread);
  }

  private async run(thread: AgentThread): Promise<AgentThread> {
    while (thread.status === "running") {
      // Rebuild context from events
      const state = deriveState(thread);
      const context = buildContextFromEvents(thread.events);

      // Get next action
      const toolCall = await this.llm.getNextAction(context);
      thread.events.push({
        type: "tool_called",
        tool: toolCall.name,
        params: toolCall.params,
        timestamp: new Date()
      });

      // Handle approval-required tools
      if (requiresApproval(toolCall)) {
        thread.events.push({
          type: "approval_requested",
          action: toolCall.name,
          timestamp: new Date()
        });
        thread.status = "paused";
        await this.db.saveThread(thread);

        // Notify human and return - execution stops here
        await this.notifyHuman(thread.id, toolCall);
        return thread;
      }

      // Execute tool
      const result = await executeToolCall(toolCall);
      thread.events.push({
        type: "tool_result",
        result,
        timestamp: new Date()
      });

      // Checkpoint after each tool call
      await this.db.saveThread(thread);

      // Check for completion
      if (isComplete(result)) {
        thread.events.push({
          type: "task_completed",
          result,
          timestamp: new Date()
        });
        thread.status = "completed";
      }
    }

    await this.db.saveThread(thread);
    return thread;
  }
}

Webhook Integration

Enable external systems to resume agents:

// Express endpoint for webhook-triggered resume
app.post("/webhook/resume/:threadId", async (req, res) => {
  const { threadId } = req.params;
  const { approved, feedback, approver } = req.body;

  const thread = await db.getThread(threadId);

  if (approved) {
    thread.events.push({
      type: "approval_granted",
      by: approver,
      timestamp: new Date(),
    });

    // Resume in background, return immediately
    agent.resume(threadId, feedback).catch(console.error);

    res.json({ status: "resuming", threadId });
  } else {
    thread.events.push({
      type: "approval_denied",
      by: approver,
      reason: feedback,
      timestamp: new Date(),
    });
    thread.status = "failed";
    await db.saveThread(thread);

    res.json({ status: "denied", threadId });
  }
});

Benefits

  • Complete audit trail
  • Time-travel debugging (replay events)
  • Natural pause/resume support
  • State reconstruction from any point
  • Concurrent access safety (append-only)

Limitations

  • More complex implementation
  • Requires external storage (database, file, etc.)
  • Event schema evolution needs care
  • Higher storage requirements

Combining Memory Tiers

Production agents often combine tiers:

class ProductionAgent {
  // Tier 3: Event log for audit and resume
  private eventStore: EventStore;

  // Tier 2: File-based for human visibility
  private progressFile: string = "progress.txt";
  private tasksFile: string = "TASKS.md";

  async afterToolCall(thread: AgentThread, toolCall: ToolCall): Promise<void> {
    // Append to event store (Tier 3)
    await this.eventStore.append(thread.id, {
      type: "tool_called",
      tool: toolCall.name,
      timestamp: new Date(),
    });

    // Update progress file (Tier 2)
    await appendFile(this.progressFile,
      `### ${new Date().toISOString()}\n- Tool: ${toolCall.name}\n`
    );

    // Commit checkpoint to git
    if (isSignificantAction(toolCall)) {
      await exec(`git add ${this.progressFile} && git commit -m "Progress: ${toolCall.name}"`);
    }
  }
}

Memory Patterns for RALPH Loop

The RALPH Loop specifically uses Tier 2 (file-based) memory because:

  1. Each iteration spawns a fresh agent (no session state)
  2. Memory must survive process boundaries
  3. Humans need to inspect and edit state
  4. Git provides durability and history

Implementing RALPH-Compatible Memory

interface RALPHState {
  currentTask: string | null;
  completedTasks: string[];
  blockers: string[];
  recentActivity: ActivityEntry[];
}

async function loadRALPHState(): Promise<RALPHState> {
  const tasks = await readFile("TASKS.md", "utf-8");
  const progress = await readFile("progress.txt", "utf-8");

  return {
    currentTask: parseCurrentTask(tasks),
    completedTasks: parseCompletedTasks(tasks),
    blockers: parseBlockers(tasks),
    recentActivity: parseRecentActivity(progress),
  };
}

async function saveRALPHState(state: RALPHState): Promise<void> {
  // Update TASKS.md with task status
  await updateTasksFile(state.completedTasks, state.currentTask);

  // Append to progress.txt
  await appendProgressEntry(state.recentActivity[0]);

  // Commit to git
  await exec("git add TASKS.md progress.txt && git commit -m 'RALPH iteration complete'");
}

Best Practices

1. Checkpoint After Every Tool Call

// Good: Checkpoint after each action
for (const toolCall of toolCalls) {
  const result = await execute(toolCall);
  await checkpoint(thread);  // Survives crash
}

// Bad: Checkpoint only at end
for (const toolCall of toolCalls) {
  const result = await execute(toolCall);
}
await checkpoint(thread);  // Crash loses all progress

2. Separate Event Log from Derived State

// Good: Events are source of truth
const events = await loadEvents(threadId);
const state = deriveState(events);

// Bad: Store both (can become inconsistent)
const { events, state } = await loadThread(threadId);

3. Include Enough Context for Resume

// Good: Event contains all needed context
{
  type: "tool_called",
  tool: "edit_file",
  params: { path: "src/auth.ts", content: "..." },
  context: { reason: "Adding JWT validation" },
  timestamp: new Date(),
}

// Bad: Event lacks context
{
  type: "tool_called",
  tool: "edit_file",
}

4. Handle Stale Checkpoints

async function resume(threadId: string): Promise<AgentThread> {
  const thread = await db.getThread(threadId);
  const age = Date.now() - thread.events.at(-1).timestamp;

  if (age > STALE_THRESHOLD) {
    // Checkpoint is old - re-verify state
    const currentState = await verifyExternalState();
    thread.events.push({
      type: "state_verified",
      verified: currentState,
      timestamp: new Date(),
    });
  }

  return run(thread);
}

5. Version Your Event Schema

interface AgentEvent {
  schemaVersion: 2;  // Increment on breaking changes
  type: string;
  timestamp: Date;
  // ... rest of event
}

function migrateEvent(event: AgentEvent): AgentEvent {
  if (event.schemaVersion === 1) {
    // Migrate v1 to v2
    return { ...event, schemaVersion: 2, newField: defaultValue };
  }
  return event;
}

Common Pitfalls

Pitfall 1: Storing Derived State

// Bad: Storing derived state leads to inconsistency
await db.save({
  events: [...events, newEvent],
  state: { ...state, stepCount: state.stepCount + 1 },  // Can drift
});

// Good: Derive state from events
await db.save({ events: [...events, newEvent] });
const state = deriveState(events);

Pitfall 2: Missing Failure Events

// Bad: Errors not captured in event log
try {
  await executeToolCall(toolCall);
} catch (error) {
  console.error(error);  // Lost
}

// Good: Errors are first-class events
try {
  await executeToolCall(toolCall);
} catch (error) {
  thread.events.push({
    type: "error",
    error: error.message,
    stack: error.stack,
    timestamp: new Date(),
  });
}

Pitfall 3: Unbounded Event Logs

// Bad: Events grow forever
thread.events.push(newEvent);

// Good: Compact old events periodically
if (thread.events.length > MAX_EVENTS) {
  const snapshot = deriveState(thread.events);
  thread.events = [
    { type: "snapshot", state: snapshot, timestamp: new Date() },
    ...thread.events.slice(-RECENT_EVENT_COUNT),
  ];
}

Pitfall 4: Blocking on Human Approval

// Bad: Process blocks waiting for human
const approval = await waitForApproval(toolCall);  // Blocks indefinitely

// Good: Persist and exit, resume via webhook
await requestApproval(toolCall);
thread.status = "paused";
await db.save(thread);
return thread;  // Process exits, webhook resumes

Related

References

Topics
12 Factor AgentsAgent ArchitectureAgent MemoryCheckpoint ResumeContext ManagementDurabilityEvent SourcingFault ToleranceSession ManagementState Persistence

More Insights

Cover Image for Own Your Control Plane

Own Your Control Plane

If you use someone else’s task manager, you inherit all of their abstractions. In a world where LLMs make software a solved problem, the cost of ownership has flipped.

James Phoenix
James Phoenix
Cover Image for Indexed PRD and Design Doc Strategy

Indexed PRD and Design Doc Strategy

A documentation-driven development pattern where a single `index.md` links all PRDs and design documents, creating navigable context for both humans and AI agents.

James Phoenix
James Phoenix