The Six Evergreen Levers of Agent Performance

James Phoenix

Whenever I am stuck in an agent loop chasing the same failure mode for the third time, the bug is rarely the agent. The bug is that I have not stepped back to ask which lever I should actually be pulling.

Author: James Phoenix | Date: May 2026

The Loop You Have to Break Out Of

Most of the time I lose with agents is not lost on hard problems. It is lost in a tight loop where I tweak a prompt, run it, watch it fail in roughly the same way, tweak the prompt again, and convince myself I am making progress. Three hours later the agent is no better and I am one cup of coffee deeper into a local minimum.

The fix is not “be smarter inside the loop”. The fix is to step out and ask which dial I am holding still without realising it. There are six of them worth turning. Knowing the menu is what stops me wasting another afternoon on prompt fiddling when the real problem is the model, the tools, or the topology around it.

The Six Levers

1. Context

Context is everything that reaches the model at inference time that is not the user’s current message. System prompt, retrieved memory, file contents, tool outputs, prior turns, skill files, instructions. If the model is not seeing the right thing at the right moment, no amount of cleverness elsewhere fixes it.

I treat context as the first place to look when an agent gets a fact wrong. Nine times out of ten the model never had the fact in front of it, or it did but it was buried under noise. Smarter retrieval, better chunking, progressive disclosure of skills, and stricter pruning of stale turns all live here. See Mutual Information for Context for how to actually measure whether more context is helping or just inflating the bill.

2. Model

Which weights are doing the inference. Opus, Sonnet, Haiku, GPT-5, Gemini, a self-hosted Qwen, a fine-tuned variant. Models are not interchangeable. A small smart model with a great harness usually beats a large dumb model with a bad one, but the gap between Haiku and Opus on hard reasoning is still real and measurable.

Model swaps are the cheapest experiment I have access to. One config line, rerun the eval, see what happens. I do this far more than I should need to, because most agent failures I assumed were “the model is dumb here” turn out to be the prompt or the tools, and most failures I assumed were “the prompt needs work” turn out to be the model picking the wrong path. See Model Switching Strategy and Model Downgrade Testing.

3. Tools

What the agent can actually do. Filesystem access, bash, browser, MCP servers, custom CLIs, linters, test runners, deploy hooks. A model that can run code and read its own output is a different species from one that cannot. See Agent Capabilities: Tools & Eyes.

The mistake I keep making here is adding tools rather than removing them. More tools means more surface area for the model to misuse, and more tokens spent describing options it will not pick. The right move is usually fewer, sharper tools that close the loop on a specific failure. If the agent keeps writing broken SQL, give it a query --explain tool, not seven new database tools.

4. Prompt

How the task is framed. Instructions, examples, output format, role priming, constraints, refusals, examples of failure, chain-of-thought scaffolds. Prompt is the lever everyone reaches for first because it is the cheapest to change. That makes it the most overfit lever in the stack.

The biggest gains I have seen from prompt work are not from adding clever phrases. They are from being explicit about output format, providing one good worked example, and removing instructions that contradict each other. See Writing a Good CLAUDE.md.

5. Orchestration

How multiple calls and multiple agents compose. A single forward pass is one topology. Actor-critic is another. A swarm of three reviewers voting on a writer’s output is another. Recursion, where an agent invokes a fresh agent on a sub-problem and returns the result, is another. Team topology, where a planner delegates to specialised sub-agents, is another.

Orchestration is the lever I underuse. When prompt and tools have hit a wall, the answer is often to stop trying to make one model do the whole job and split it. Generator and evaluator. Writer and critic. Planner and executor. See The Actor-Critic Pattern, Agent Swarm Patterns, and Generator-Evaluator Harness Design.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

6. Fine-tuning

When prompting hits its ceiling and you genuinely need the weights to behave differently, you train them. LoRA adapters, full fine-tunes, reinforcement fine-tuning on graded outputs. This is the most expensive lever and almost always the wrong first move. It is also the right last move when you have run out of room on the other five.

Fine-tuning earns its place when a behaviour is too consistent to be worth re-explaining every prompt, when token cost dominates and a smaller fine-tuned model can replace a larger general one, or when a niche domain requires vocabulary the base model cannot reach.

The Eval Wrapper Around All Six

Pulling a lever is not the same as knowing it helped. Without measurement, every change is a vibe.

The frame I keep coming back to is:

evals(scenario) over (model, tools, prompt, orchestration, fine-tuning)
  -> Experiment<Result>

Context is the scenario. It is the dataset, the inputs, the ground truth, the slice of reality you care about. The other five are the dials you actually turn. An eval pass is a controlled experiment: hold the scenario fixed, change one lever, measure the outcome, decide whether to keep it.

This is the difference between “I think the new prompt is better” and “the new prompt scores 14 percentage points higher on the regression suite, with overlap in the confidence intervals on the easy slice and a clear win on the hard slice”. One is conversation. The other is evidence.

Evals Are Dressed-Up Science

This is the part I want to be unfashionably blunt about. Evals are not a new discipline. They are science with new vocabulary.

A hypothesis (“this prompt change improves faithfulness”). A manipulated variable (the prompt). A held-fixed control set (the eval scenario). A measurement (the score). A decision rule (keep, reject, run more). The whole apparatus is the scientific method dressed up in a YAML config. Replace “agent” with “drug” and “score” with “patient outcome” and you have a randomised controlled trial.

Treating evals as just science strips away the mystique and forces honesty. If I cannot articulate the hypothesis, I am not running an experiment. I am hoping. If the held-out set leaks into the prompt, I am p-hacking. If I keep the change because the demo felt better, I am running a vibes test. The discipline is old. Only the substrate is new.

The Diagnostic

When I am stuck in the loop, I run through the six in order of cost:

Prompt (free, fast).
Context (cheap, fast).
Tools (medium, sometimes structural).
Model (one config line, rerun).
Orchestration (a redesign, but reversible).
Fine-tuning (expensive, slow, last resort).

For each, I ask: am I currently holding this still? If the answer is yes for all of them, I am not stuck on the agent. I am stuck on which experiment to run next. That is a much better problem to have, because it is a problem evals can answer.