If your skill only works with Opus, you don’t have a good skill. You have a good model compensating for bad instructions.
Author: James Phoenix | Date: March 2026
The Problem
You write a skill. You test it with your best model. It works. You ship it.
Months later, it fails on a run where the model resolves an ambiguity differently. Or a teammate uses a smaller model and gets garbage output. Or you swap to a cheaper model for cost reasons and everything breaks.
The skill was never robust. The model was just smart enough to paper over the gaps. Big models like Opus can push past broken references, resolve conflicting directions, and guess your intent when the instructions are vague. That’s great for completing the task. It’s terrible for debugging the instructions.
You don’t notice the problems because the model is clever enough to route around them.
The Pattern: Progressive Model Downgrade
Test your skills with progressively smaller models. Each downgrade exposes a new class of instruction quality issues.
Write skill
→ Test with Opus (same tier that wrote it)
→ Iterate until reliable
→ Test with Sonnet
→ Iterate until reliable
→ Test with Haiku
→ Iterate until reliable
→ (Optional) Test with smallest viable model
Each step down strips away a layer of model intelligence, forcing the instructions to carry more of the load. If Haiku can complete the skill reliably, the instructions are genuinely clear. If only Opus can do it, the instructions are leaking intent that the model has to reconstruct.

What Each Tier Exposes
| Model Tier | What It Catches |
|---|---|
| Same model (Opus) | Logical gaps, missing steps, wrong tool references |
| Mid-tier (Sonnet) | Ambiguous phrasing, implicit assumptions, under-specified edge cases |
| Small model (Haiku) | Vague directions, broken references, conflicting instructions, missing context |
| Minimal model | Everything that isn’t completely explicit and self-contained |
Why This Works
This pattern exists everywhere in human product design. The food industry, medical devices, military equipment. The principle is the same: for a thing to work, it needs to work in the worst-plausible circumstances. Your newest employee, fresh recruit, first day on the job, busy rush.
A smaller model is that fresh recruit. It follows instructions literally. It doesn’t fill gaps with common sense. It doesn’t search for docs that are mentioned but not linked. It doesn’t push past broken references.
That’s exactly what makes it useful for debugging. Bad directions produce bad outcomes, and the bad outcomes point directly at the instruction defect.

The Debugging Signal
When a big model fails, the failure mode is subtle. It might silently resolve an ambiguity the wrong way, producing output that looks correct but isn’t. You have to carefully inspect the result to notice.
When a small model fails, the failure mode is loud. It gets stuck, produces obviously wrong output, or halts entirely. The failure points straight at the problem in your instructions.
Opus failure: "Completed successfully" (but did it wrong silently)
Haiku failure: "Error: cannot find referenced file X" (instruction defect exposed)
Implementation
Step 1: Automated Skill Testing Loop
Have your best model manage the entire process. It writes the skill, spawns a sub-agent with the target model to attempt the skill, reads the logs and results, and iterates on the skill to fix any problems.
Opus (orchestrator):
1. Write/refine the skill
2. Spawn Sonnet sub-agent to attempt the skill
3. Read sub-agent logs and output
4. Identify instruction defects from failures
5. Refine the skill
6. Repeat until sub-agent succeeds twice consecutively
7. Drop to Haiku and repeat
The “twice consecutively” threshold matters. LLMs are stochastic. A single success might be luck. Two in a row suggests the instructions are genuinely robust at that model tier.
Step 2: Classify the Failures
Not all failures are instruction defects. Categorize what you find:
| Failure Type | Fix Location |
|---|---|
| Missing context | Add to skill preamble or link referenced docs |
| Ambiguous phrasing | Rewrite with explicit, concrete language |
| Implicit assumptions | Make assumptions explicit in the instructions |
| Capability gap | The model genuinely can’t do this. Accept a higher model floor |
| Tool misuse | Add tool usage examples or constraints |
Step 3: Accept the Model Floor
Some skills genuinely require a certain level of reasoning. A complex architectural review might need Opus regardless of how well you write the instructions. That’s fine. The point isn’t that every skill must work on Haiku. The point is that you’ve found the actual floor and you know why it exists.
The floor should be a capability limitation, not an instruction quality problem.
What Gets Fixed
Teams that run this process consistently find the same categories of improvement:
Broken references. The skill mentions a file, config, or doc that doesn’t exist or has moved. Opus finds it anyway. Haiku doesn’t.
Conflicting directions. Step 3 says “always use the staging endpoint” but step 7 says “deploy to production.” Opus resolves the contradiction by inferring context. Haiku follows whichever instruction it encounters last.
Implicit sequencing. Steps that must happen in order but aren’t marked as dependent. Opus infers the dependency. Haiku runs them in whatever order it encounters them.
Missing tool specifications. “Search the codebase for X” without specifying which tool (Grep vs Glob vs Agent). Opus picks the right one. Haiku guesses.
Under-specified outputs. “Return the results” without format, location, or structure. Opus produces something reasonable. Haiku produces something technically compliant but useless.
Connection to Existing Patterns
This pattern sits at the intersection of several ideas in the knowledge base:
Sub-agents as test runners. The orchestrator/worker split from Sub-agents: Accuracy vs Latency applies directly. The orchestrator (Opus) manages the loop. The worker (Sonnet/Haiku) is the test subject.
Skill bootstrapping quality. Agent Skill Bootstrapping describes agents creating their own skills. Model downgrade testing is the quality gate for those bootstrapped skills. An agent-created skill that only works with the creating model isn’t reusable.
Constraint-first thinking. Constraint-First Development says to define what must be true, then let the system find how. Here the constraint is “this skill must work at Sonnet tier.” The refinement loop finds the how.
Monte Carlo hardening. Monte Carlo Quality Assurance uses repeated passes to surface stochastic failures. Model downgrade testing adds a second dimension. You’re not just running the same model multiple times. You’re running different capability levels to surface different failure classes.
Key Insight
Soldier-proofing is not about dumbing down your skills. It’s about making the instructions carry the intelligence instead of relying on the model to supply it. The model should execute, not interpret.
The whole process should be managed by your best model, ideally without your input. Even if only Opus can complete the task in the end, the refinement loop will find places to save tokens, clarify intent, and eliminate silent failure modes.
Related
- Sub-agents: Accuracy vs Latency – Model tier selection for sub-agent workers
- Agent Skill Bootstrapping – Agents creating their own skills (which need hardening)
- Constraint-First Development – Define constraints, let the system find solutions
- Monte Carlo Quality Assurance – Stochastic hardening through repeated passes
- Runtime Skills – Skills as the unit being tested
- Model Switching Strategy – When to use which model tier
- Actor-Critic Adversarial Coding – Adversarial patterns for quality assurance
References
- Steve Ruiz (@steveruizok): Soldier-proofing agent skills – Original thread describing the progressive model downgrade pattern (March 2026)

