Model Downgrade Testing Hardens Agent Skills

James Phoenix
James Phoenix

If your skill only works with Opus, you don’t have a good skill. You have a good model compensating for bad instructions.

Author: James Phoenix | Date: March 2026


The Problem

You write a skill. You test it with your best model. It works. You ship it.

Months later, it fails on a run where the model resolves an ambiguity differently. Or a teammate uses a smaller model and gets garbage output. Or you swap to a cheaper model for cost reasons and everything breaks.

The skill was never robust. The model was just smart enough to paper over the gaps. Big models like Opus can push past broken references, resolve conflicting directions, and guess your intent when the instructions are vague. That’s great for completing the task. It’s terrible for debugging the instructions.

You don’t notice the problems because the model is clever enough to route around them.


The Pattern: Progressive Model Downgrade

Test your skills with progressively smaller models. Each downgrade exposes a new class of instruction quality issues.

Write skill
    → Test with Opus (same tier that wrote it)
    → Iterate until reliable
    → Test with Sonnet
    → Iterate until reliable
    → Test with Haiku
    → Iterate until reliable
    → (Optional) Test with smallest viable model

Each step down strips away a layer of model intelligence, forcing the instructions to carry more of the load. If Haiku can complete the skill reliably, the instructions are genuinely clear. If only Opus can do it, the instructions are leaking intent that the model has to reconstruct.

Progressive model downgrade funnel showing Opus catching structural errors, Sonnet catching ambiguity, and Haiku catching everything implicit
Progressive model downgrade funnel showing Opus catching structural errors, Sonnet catching ambiguity, and Haiku catching everything implicit

What Each Tier Exposes

Model Tier What It Catches
Same model (Opus) Logical gaps, missing steps, wrong tool references
Mid-tier (Sonnet) Ambiguous phrasing, implicit assumptions, under-specified edge cases
Small model (Haiku) Vague directions, broken references, conflicting instructions, missing context
Minimal model Everything that isn’t completely explicit and self-contained

Why This Works

This pattern exists everywhere in human product design. The food industry, medical devices, military equipment. The principle is the same: for a thing to work, it needs to work in the worst-plausible circumstances. Your newest employee, fresh recruit, first day on the job, busy rush.

A smaller model is that fresh recruit. It follows instructions literally. It doesn’t fill gaps with common sense. It doesn’t search for docs that are mentioned but not linked. It doesn’t push past broken references.

That’s exactly what makes it useful for debugging. Bad directions produce bad outcomes, and the bad outcomes point directly at the instruction defect.

Comparison of big model vs small model failure signals across three scenarios: broken references, conflicting steps, and missing tool specs
Comparison of big model vs small model failure signals across three scenarios: broken references, conflicting steps, and missing tool specs

The Debugging Signal

When a big model fails, the failure mode is subtle. It might silently resolve an ambiguity the wrong way, producing output that looks correct but isn’t. You have to carefully inspect the result to notice.

When a small model fails, the failure mode is loud. It gets stuck, produces obviously wrong output, or halts entirely. The failure points straight at the problem in your instructions.

Opus failure:    "Completed successfully" (but did it wrong silently)
Haiku failure:   "Error: cannot find referenced file X" (instruction defect exposed)

Implementation

Step 1: Automated Skill Testing Loop

Have your best model manage the entire process. It writes the skill, spawns a sub-agent with the target model to attempt the skill, reads the logs and results, and iterates on the skill to fix any problems.

Opus (orchestrator):
  1. Write/refine the skill
  2. Spawn Sonnet sub-agent to attempt the skill
  3. Read sub-agent logs and output
  4. Identify instruction defects from failures
  5. Refine the skill
  6. Repeat until sub-agent succeeds twice consecutively
  7. Drop to Haiku and repeat

The “twice consecutively” threshold matters. LLMs are stochastic. A single success might be luck. Two in a row suggests the instructions are genuinely robust at that model tier.

Step 2: Classify the Failures

Not all failures are instruction defects. Categorize what you find:

Failure Type Fix Location
Missing context Add to skill preamble or link referenced docs
Ambiguous phrasing Rewrite with explicit, concrete language
Implicit assumptions Make assumptions explicit in the instructions
Capability gap The model genuinely can’t do this. Accept a higher model floor
Tool misuse Add tool usage examples or constraints

Step 3: Accept the Model Floor

Some skills genuinely require a certain level of reasoning. A complex architectural review might need Opus regardless of how well you write the instructions. That’s fine. The point isn’t that every skill must work on Haiku. The point is that you’ve found the actual floor and you know why it exists.

The floor should be a capability limitation, not an instruction quality problem.


What Gets Fixed

Teams that run this process consistently find the same categories of improvement:

Broken references. The skill mentions a file, config, or doc that doesn’t exist or has moved. Opus finds it anyway. Haiku doesn’t.

Conflicting directions. Step 3 says “always use the staging endpoint” but step 7 says “deploy to production.” Opus resolves the contradiction by inferring context. Haiku follows whichever instruction it encounters last.

Implicit sequencing. Steps that must happen in order but aren’t marked as dependent. Opus infers the dependency. Haiku runs them in whatever order it encounters them.

Missing tool specifications. “Search the codebase for X” without specifying which tool (Grep vs Glob vs Agent). Opus picks the right one. Haiku guesses.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

Under-specified outputs. “Return the results” without format, location, or structure. Opus produces something reasonable. Haiku produces something technically compliant but useless.


Connection to Existing Patterns

This pattern sits at the intersection of several ideas in the knowledge base:

Sub-agents as test runners. The orchestrator/worker split from Sub-agents: Accuracy vs Latency applies directly. The orchestrator (Opus) manages the loop. The worker (Sonnet/Haiku) is the test subject.

Skill bootstrapping quality. Agent Skill Bootstrapping describes agents creating their own skills. Model downgrade testing is the quality gate for those bootstrapped skills. An agent-created skill that only works with the creating model isn’t reusable.

Constraint-first thinking. Constraint-First Development says to define what must be true, then let the system find how. Here the constraint is “this skill must work at Sonnet tier.” The refinement loop finds the how.

Monte Carlo hardening. Monte Carlo Quality Assurance uses repeated passes to surface stochastic failures. Model downgrade testing adds a second dimension. You’re not just running the same model multiple times. You’re running different capability levels to surface different failure classes.


Key Insight

Soldier-proofing is not about dumbing down your skills. It’s about making the instructions carry the intelligence instead of relying on the model to supply it. The model should execute, not interpret.

The whole process should be managed by your best model, ideally without your input. Even if only Opus can complete the task in the end, the refinement loop will find places to save tokens, clarify intent, and eliminate silent failure modes.


Related

References

Topics
Agent SkillsContext EngineeringDebuggingModel SelectionRobustnessSkill HardeningSkillsTesting

More Insights

Cover Image for Adaptive Query Expansion for Agent Review

Adaptive Query Expansion for Agent Review

Static review checklists miss context-specific issues. Borrowing “query expansion” from information retrieval, a coordinator agent dynamically generates review queries tailored to the artifact under r

James Phoenix
James Phoenix
Cover Image for ESLint Rules Are Programs

ESLint Rules Are Programs

Lint rules don’t just catch mistakes. When stacked deliberately, they form a declarative program that guides agent behaviour through a codebase.

James Phoenix
James Phoenix