A skill you cannot measure against no skill is just a prompt you felt good about. The whole value of a skill is the delta it buys you over a bare model, and that delta only shows up when you run both sides of the comparison.
The Problem With How Most People Ship Skills
Writing a skill is cheap now. Claude Code, the Agent SDK, and every internal harness I touch make it trivial to drop a SKILL.md in a folder and claim I have taught the agent something new. The problem is that almost nobody measures whether the skill actually helped. They try it once on a friendly prompt, see a good output, and ship it.
That is the worst possible feedback loop. The model was probably going to handle the friendly prompt on its own. The skill might be adding nothing. Worse, it might be burning tokens, slowing the run, and nudging the agent toward a narrower solution than it would have picked for free. Without a baseline, I cannot tell the difference between a skill that is earning its keep and a skill that is quietly making everything worse.
The fix is not complicated. It is eval driven iteration, the same loop I run on prompts, models, and retrieval systems. I owe the concrete shape of the loop below to Anthropic’s Evaluating skill output quality guide, which is the cleanest writeup I have seen. The opinions are mine.
The Core Move: Always Compare Against No Skill
If I remember one thing from the whole topic, it is this. Every test case runs twice. Once with the skill loaded, once without. The without run is my baseline. The delta between the two is the only honest measurement of whether the skill is pulling its weight.
This sounds obvious and I still catch myself skipping it. It is tempting to look at a nice with-skill output and call it done. The discipline is to force myself to read the bare-model output too, because half the time the bare model produced something just as good.
When I am iterating on an existing skill instead of writing a new one, the same rule applies. I snapshot the previous version and treat it as the baseline. The delta I care about is now improvement over my last release, not improvement over nothing.
What a Test Case Actually Contains
A test case has three parts, and they are all about realism.
- A prompt that sounds like a real user request, not a cleaned up spec. Real users mention filenames, paste half an error message, and forget to say what format they want the answer in.
- A plain English description of what success looks like. No pass or fail checks yet. Those come after I have seen what the model actually produces.
- Any input files the skill needs to touch, checked into the skill directory.
Here is the shape I use for a skill I am building around video transcript summarisation.
# evals/evals.py
EVALS = [
{
"id": "transcript-long-interview",
"prompt": "here's a 90 min interview transcript in data/interview.vtt, "
"can you pull the 5 strongest quotes and say why each one matters?",
"expected_output": "Five quoted passages with speaker attribution and a "
"one-sentence rationale per quote.",
"files": ["evals/files/interview.vtt"],
},
{
"id": "transcript-noisy-autocaptions",
"prompt": "transcript is rough, lots of [inaudible] and filler. "
"give me a clean summary focused on decisions and owners.",
"expected_output": "Decision-focused summary that ignores filler "
"and flags anything unclear rather than inventing it.",
"files": ["evals/files/meeting-autocap.vtt"],
},
]
I start with three cases, not thirty. The goal of the first round is to see how the skill behaves, not to build a leaderboard. I will double the set in iteration two, once I know where the interesting failures live.
One of those three cases must be an edge case. A malformed file, a vague request, a prompt that pushes on the exact ambiguity I suspect is lurking in my SKILL.md. The boring cases teach me nothing.
Isolate Every Run
The rule I break most often is context hygiene. Every eval run must start from a clean slate. No leftover memory from the last attempt, no half remembered instructions from the session where I wrote the skill. If the run inherits context, it inherits bias, and the measurement stops meaning anything.
In Claude Code I get this for free by spawning each run as a subagent. In the Agent SDK I spin up a fresh session per run. In a plain script I launch a new process per case. Whatever it takes, the agent in the run has only two inputs: the skill (or not) and the prompt.
Assertions Come Second
The mistake I used to make was writing assertions before running the skill. I would guess what the output ought to look like, bake those guesses into pass or fail checks, and then wonder why my scores were useless. Half the assertions were wrong, and the other half were hitting things the model would always do regardless.
Now I run the first round with just the prompt and the expected output, read the real outputs, and then write the assertions. An assertion earns its place if it is concrete, verifiable, and would actually flip to fail on a bad output.
Good:
- The output file parses as valid JSON.
- The chart has labeled axes and exactly three bars.
- The summary attributes every quote to a named speaker.
Bad:
- The output is good.
- The summary uses the exact phrase “key decisions.”
Some qualities refuse to decompose into pass or fail. Writing style, whether the tone matches the user, whether the chart is pleasant to look at. I do not force those into assertions. I save them for the human review step.
Grading With Evidence, Not Vibes
Grading is where LLM judges earn their keep. I hand the judge the assertion, the output, and ask it to produce a pass or fail plus a short piece of evidence quoting or pointing at the output. Assertions that can be checked mechanically, like “file parses as JSON” or “exactly five rows,” go to a Python script instead. Scripts do not hallucinate.
Two discipline points matter here.
First, do not give the benefit of the doubt. If an assertion says “includes a rationale per quote” and the output has a section with the label but no real reasoning, that is a fail. The label is not the content.
Second, grade the assertion itself while grading the output. If an assertion is always passing in both with-skill and without-skill runs, it is dead weight and I kill it for the next iteration. If it is always failing in both, either the assertion is broken or the task is impossible and I need to fix whichever one it is.
Aggregate Into a Delta, Not a Score
Once every run is graded, I roll them up into a single benchmark per iteration. The fields I care about are pass rate, mean time per run, and mean tokens per run, computed separately for with-skill and without-skill, plus the delta between the two.
The delta is the whole story. A skill that lifts pass rate by fifty points while adding ten seconds and two thousand tokens is almost always worth it. A skill that lifts pass rate by three points while doubling token cost is not. I would rather delete that skill and let the bare model handle it.
Patterns, Not Averages
Averages lie. After I compute the benchmark I pull on the individual cases and look for structure.
- Assertions that pass with the skill and fail without it are the skill’s actual value. I study those transcripts until I understand exactly which instruction or script did the work.
- Assertions that pass in both configurations are candidates for deletion.
- Assertions that are flaky, passing sometimes and failing others on the same prompt, point to either prompt ambiguity or instructions in the skill that leave too much room for interpretation.
- Runs that are three times slower than their peers get their transcripts read end to end. The bottleneck is almost always a single over-cautious instruction that sent the agent down a rabbit hole.
Humans Still Review
Assertions catch what I thought to check. A human reviewer catches everything else. For each test case I read the actual output alongside the grade and write a one-line note. If I have nothing to say, the note is empty and the case passed review. If I do have something to say, I make it specific enough to act on. “The chart months are in alphabetical order instead of chronological” is actionable. “Feels off” is not.
The Iteration Loop
After grading, review, and pattern analysis I have three signals: failed assertions, human notes, and execution transcripts. The most efficient way to turn them into skill improvements is to hand all three, along with the current SKILL.md, to a capable model and ask it to propose a revision. The model can spot patterns across runs that would take me an hour to find by hand.
The guidance I give the revising model is the same every time:
- Generalise from the feedback. The skill needs to handle prompts I did not test.
- Keep the skill lean. Fewer, sharper instructions beat a rulebook.
- Explain the why. “Do X because otherwise Y happens” follows more reliably than “always do X.”
- Bundle repeated work. If three runs all wrote the same helper script, that script belongs in the skill, not in the transcripts.
Then I apply the changes, rerun every test case into a new iteration-N/ folder, grade, aggregate, review, and repeat. I stop when the delta plateaus, the human notes come back empty, or I have hit the point where the skill is earning its keep and further tuning is not worth the evening.
Why This Matters
Skills are the smallest unit of durable compounding in an agent workflow. Every skill I ship that actually earns its delta makes every future run a little cheaper and a little better, forever. Every skill I ship on vibes is a hidden tax that I will pay on every future run until someone notices. The eval loop is the difference between those two outcomes, and it is the cheapest insurance I know.
Related
- Auto-Harness Synthesis
- The Session Audit Meta-Prompt
- Agent-Driven Development
- Evaluator-Optimizer and Evolutionary Search
Sources
- Anthropic, Evaluating skill output quality, agentskills.io
skill-creatorreference skill that automates much of the loop

