Long-Running Agent Patterns: Shell, Skills, and Compaction

James Phoenix

Production agents that run for extended periods need three primitives: reusable skills, persistent shell environments, and proactive compaction.

Source: OpenAI Engineering

Core Primitives

Long-running agents depend on three interlocking pieces:

Skills – Reusable instruction bundles with SKILL.md manifests containing frontmatter metadata and workflow procedures. The model reads metadata to decide whether to invoke a skill.
Hosted Shell – Container execution where agents install dependencies, run scripts, and write outputs. State persists across steps via previous_response_id.
Server-Side Compaction – Automatic context management that compresses conversation history to keep long runs moving.

Pattern 1: Compaction as Default Primitive

Use compaction proactively from the start, not as an emergency fallback when context overflows.

Why this matters for long runs: Without proactive compaction, agents exhibit restart behavior. They lose track of earlier steps, re-read files they already processed, and repeat work. Making compaction a default architectural choice maintains thread coherence across dozens of tool calls.

Contrast with reactive compaction: Most teams only compact when they hit context limits (Context Rot Auto-Compacting covers the reactive case). The proactive approach means compaction runs on a schedule regardless of context size, preserving a clean working state throughout.

Pattern 2: Container Reuse Across Steps

Reuse the same container across steps when you want stable dependencies, cached files, and intermediate outputs.

Step 1: pip install pandas matplotlib → container state saved
Step 2: Load data, generate charts   → reuses installed deps
Step 3: Write report to /mnt/data    → accesses step 2 outputs

Pass previous_response_id for continuation. This avoids the cold-start penalty of reinstalling dependencies per step and allows agents to build on intermediate results.

When to use fresh containers: When steps are independent, when you need reproducibility guarantees, or when a previous step left the environment in a broken state.

Pattern 3: Artifact Handoff Boundary

Treat a standard output location (e.g. /mnt/data) as the handoff point between agent steps and human review.

Agent writes → /mnt/data/report.pdf
Human reviews → downloads artifact
Next agent step → reads from /mnt/data/

This creates a “clean review boundary.” Artifacts are concrete deliverables (reports, cleaned datasets, generated code), not ephemeral context. The boundary makes agent work inspectable and resumable.

Pattern 4: Skill Routing with Negative Examples

Skill descriptions should answer three questions:

When to use this skill
When NOT to use this skill
What outputs to expect

The negative examples are critical. Without them, skills misfire on edge cases:

# SKILL.md

## Description
Generate quarterly sales reports from CRM data.

## Use when
- User asks for sales summaries, pipeline reports, or revenue breakdowns
- Data source is Salesforce or HubSpot

## Don't use when
- User wants marketing analytics (use marketing-report skill)
- User asks for individual deal details (use deal-lookup skill)
- Data source is a custom CSV (use data-analysis skill)

Glean’s case study: routing accuracy dropped 20% initially, then recovered after adding edge case coverage to skill descriptions.

Pattern 5: Explicit Triggering for Determinism

For production workflows with clear contracts, bypass implicit routing entirely:

"Use the `quarterly-report` skill with Q4 2025 data."

Implicit routing (model decides which skill) works for exploratory use. Explicit triggering works for production pipelines where you know exactly which skill should run. This is the difference between a chatbot and a workflow engine.

Pattern 6: Install, Fetch, Artifact

A three-phase pattern for deterministic deliverables:

Phase 1: Install    → Set up environment, install dependencies
Phase 2: Fetch      → Pull external data, read files, query APIs
Phase 3: Artifact   → Write concrete deliverable to disk

Each phase has a clear purpose and failure mode. If install fails, you don’t waste tokens on fetch. If fetch fails, you don’t generate a bad artifact. The artifact phase always produces something reviewable.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Pattern 7: Two-Layer Security Allowlist

For agents with network access, use two constraint layers:

Org-level allowlist  → Maximum approved destinations (small, stable)
Request-level subset → Specific domains needed for this job (even smaller)

Never combine skills with open network access. This creates a data exfiltration path. Keep org lists small and stable. Keep request lists even smaller.

Domain secrets: If an allowed domain needs auth headers, use a sidecar that injects real credentials only for approved destinations. The model never sees raw credentials.

Pattern 8: Skills as Living SOPs

Skills become enterprise Standard Operating Procedures that evolve with the organization.

Glean case study: a Salesforce-oriented skill increased accuracy from 73% to 85% and reduced time-to-first-token by 18.1%. The skill encodes organizational knowledge (which fields matter, how deals are categorized, what “qualified” means in this company) that would otherwise live in tribal knowledge.

Key insight: Move templates and worked examples inside skills. They’re available exactly when needed and don’t inflate tokens for unrelated queries.

Decision Framework: Which Primitives to Combine

Scenario	Skills	Shell	Compaction
Quick Q&A	Optional	No	No
Data analysis task	Yes	Yes	No
Multi-step workflow (10+ steps)	Yes	Yes (reuse container)	Yes
Long research session	Optional	Optional	Yes (proactive)
Production pipeline	Yes (explicit trigger)	Yes	Yes

Context Rot Auto-Compacting – Reactive compaction strategies
Progressive Disclosure – Skill metadata as routing layer
Agent Memory Patterns – State persistence across sessions
Building the Harness – Four-layer harness architecture
Sub-Agent Architecture – Delegation patterns for complex tasks
Ad-hoc to Scripts – Converting repeated workflows to deterministic execution

References

OpenAI: Shell + Skills + Compaction – Original article
Anthropic: Effective Harnesses – Complementary harness patterns