Production agents that run for extended periods need three primitives: reusable skills, persistent shell environments, and proactive compaction.
Source: OpenAI Engineering
Core Primitives
Long-running agents depend on three interlocking pieces:
- Skills – Reusable instruction bundles with
SKILL.mdmanifests containing frontmatter metadata and workflow procedures. The model reads metadata to decide whether to invoke a skill. - Hosted Shell – Container execution where agents install dependencies, run scripts, and write outputs. State persists across steps via
previous_response_id. - Server-Side Compaction – Automatic context management that compresses conversation history to keep long runs moving.
Pattern 1: Compaction as Default Primitive
Use compaction proactively from the start, not as an emergency fallback when context overflows.
Why this matters for long runs: Without proactive compaction, agents exhibit restart behavior. They lose track of earlier steps, re-read files they already processed, and repeat work. Making compaction a default architectural choice maintains thread coherence across dozens of tool calls.
Contrast with reactive compaction: Most teams only compact when they hit context limits (Context Rot Auto-Compacting covers the reactive case). The proactive approach means compaction runs on a schedule regardless of context size, preserving a clean working state throughout.
Pattern 2: Container Reuse Across Steps
Reuse the same container across steps when you want stable dependencies, cached files, and intermediate outputs.
Step 1: pip install pandas matplotlib → container state saved
Step 2: Load data, generate charts → reuses installed deps
Step 3: Write report to /mnt/data → accesses step 2 outputs
Pass previous_response_id for continuation. This avoids the cold-start penalty of reinstalling dependencies per step and allows agents to build on intermediate results.
When to use fresh containers: When steps are independent, when you need reproducibility guarantees, or when a previous step left the environment in a broken state.
Pattern 3: Artifact Handoff Boundary
Treat a standard output location (e.g. /mnt/data) as the handoff point between agent steps and human review.
Agent writes → /mnt/data/report.pdf
Human reviews → downloads artifact
Next agent step → reads from /mnt/data/
This creates a “clean review boundary.” Artifacts are concrete deliverables (reports, cleaned datasets, generated code), not ephemeral context. The boundary makes agent work inspectable and resumable.
Pattern 4: Skill Routing with Negative Examples
Skill descriptions should answer three questions:
- When to use this skill
- When NOT to use this skill
- What outputs to expect
The negative examples are critical. Without them, skills misfire on edge cases:
# SKILL.md
## Description
Generate quarterly sales reports from CRM data.
## Use when
- User asks for sales summaries, pipeline reports, or revenue breakdowns
- Data source is Salesforce or HubSpot
## Don't use when
- User wants marketing analytics (use marketing-report skill)
- User asks for individual deal details (use deal-lookup skill)
- Data source is a custom CSV (use data-analysis skill)
Glean’s case study: routing accuracy dropped 20% initially, then recovered after adding edge case coverage to skill descriptions.
Pattern 5: Explicit Triggering for Determinism
For production workflows with clear contracts, bypass implicit routing entirely:
"Use the `quarterly-report` skill with Q4 2025 data."
Implicit routing (model decides which skill) works for exploratory use. Explicit triggering works for production pipelines where you know exactly which skill should run. This is the difference between a chatbot and a workflow engine.
Pattern 6: Install, Fetch, Artifact
A three-phase pattern for deterministic deliverables:
Phase 1: Install → Set up environment, install dependencies
Phase 2: Fetch → Pull external data, read files, query APIs
Phase 3: Artifact → Write concrete deliverable to disk
Each phase has a clear purpose and failure mode. If install fails, you don’t waste tokens on fetch. If fetch fails, you don’t generate a bad artifact. The artifact phase always produces something reviewable.
Pattern 7: Two-Layer Security Allowlist
For agents with network access, use two constraint layers:
Org-level allowlist → Maximum approved destinations (small, stable)
Request-level subset → Specific domains needed for this job (even smaller)
Never combine skills with open network access. This creates a data exfiltration path. Keep org lists small and stable. Keep request lists even smaller.
Domain secrets: If an allowed domain needs auth headers, use a sidecar that injects real credentials only for approved destinations. The model never sees raw credentials.
Pattern 8: Skills as Living SOPs
Skills become enterprise Standard Operating Procedures that evolve with the organization.
Glean case study: a Salesforce-oriented skill increased accuracy from 73% to 85% and reduced time-to-first-token by 18.1%. The skill encodes organizational knowledge (which fields matter, how deals are categorized, what “qualified” means in this company) that would otherwise live in tribal knowledge.
Key insight: Move templates and worked examples inside skills. They’re available exactly when needed and don’t inflate tokens for unrelated queries.
Decision Framework: Which Primitives to Combine
| Scenario | Skills | Shell | Compaction |
|---|---|---|---|
| Quick Q&A | Optional | No | No |
| Data analysis task | Yes | Yes | No |
| Multi-step workflow (10+ steps) | Yes | Yes (reuse container) | Yes |
| Long research session | Optional | Optional | Yes (proactive) |
| Production pipeline | Yes (explicit trigger) | Yes | Yes |
Related
- Context Rot Auto-Compacting – Reactive compaction strategies
- Progressive Disclosure – Skill metadata as routing layer
- Agent Memory Patterns – State persistence across sessions
- Building the Harness – Four-layer harness architecture
- Sub-Agent Architecture – Delegation patterns for complex tasks
- Ad-hoc to Scripts – Converting repeated workflows to deterministic execution
References
- OpenAI: Shell + Skills + Compaction – Original article
- Anthropic: Effective Harnesses – Complementary harness patterns

