Code Complexity Is Agent Drag

James Phoenix

Complexity used to be a tax on the humans who read the code. In the agent era, it is a tax on every inference, every edit, and every tool call that touches the file.

Why I Keep Coming Back to This

When I started thinking about compound engineering, I treated complexity as a style problem. Pretty code, ugly code, whatever, ship it. That stopped working the moment I handed most of my editing over to agents. A function with five nested conditionals does not just slow down the next human reader. It slows down the model that has to plan a patch, the test generator that has to enumerate branches, and the reviewer agent that has to reason about which path I broke.

Code complexity, in my operating definition, is the cost of predicting what a piece of code will do without running it. Humans pay that cost in reading time. Agents pay it in tokens, wrong guesses, and retries. Both of those compound.

The Metrics I Actually Track

I ignore most of the long metric lists static analysers emit. These are the five I care about, in order of how often they change my decisions.

1. Cyclomatic Complexity (McCabe, 1976)

Thomas McCabe introduced cyclomatic complexity in his IEEE paper A Complexity Measure. The metric counts the number of linearly independent paths through a function, which for most practical code is the number of decision points plus one. An if adds one. A for adds one. A ternary adds one. A boolean operator inside a condition adds one.

McCabe’s original threshold was 10. I use 8 as my personal line in the sand, because I want headroom for the inevitable bug fix that adds one more branch.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

2. Cognitive Complexity

Cognitive complexity, popularised by SonarSource, is the metric I wish I had been given in my first job. It penalises nesting harder than flat branching, and it ignores structures that do not actually raise the reading cost (like the else in a short guard). It is a better proxy for how long it takes a human to load a function into working memory, and in my experience it also correlates well with how often agents write buggy patches.

3. Halstead Volume

Halstead volume treats a program as a bag of operators and operands and computes V = N * log2(n), where N is total token count and n is vocabulary size. It is crude, but it catches a failure mode cyclomatic misses: code with few branches but huge, dense expressions. A 40 line regex builder has cyclomatic 1 and Halstead volume through the roof, and I should still be nervous about it.

4. Maintainability Index

Maintainability index is a weighted combination of Halstead volume, cyclomatic complexity, and lines of code. I do not trust the absolute number. I trust the delta. If a PR drops the maintainability index of a file from 72 to 51, that is a conversation starter even if every individual metric is still within bounds.

5. Coupling and Depth of Inheritance

These are module level, not function level. High afferent coupling means a file is load bearing, so churn there is expensive. Deep inheritance means a bug can live four classes above where it surfaces. I do not have strict thresholds here. I use them as tiebreakers when deciding where to refactor first.

A Python Example

Here is a function I have written some version of a hundred times. It is the kind of thing that grows one branch per sprint until nobody wants to touch it.

def price_order(order, user, coupon, region):
    total = 0
    for item in order.items:
        if item.kind == "digital":
            if user.is_student:
                total += item.price * 0.5
            else:
                total += item.price
        elif item.kind == "physical":
            if region == "EU":
                if item.weight > 2:
                    total += item.price + 10
                else:
                    total += item.price + 4
            else:
                total += item.price + 6
        if coupon and coupon.applies_to(item):
            total -= coupon.discount
    return max(total, 0)

Cyclomatic complexity here is around 9. Cognitive complexity is worse because of the triple nesting inside the physical branch. A refactor that pushes each pricing rule into its own small function collapses both numbers and makes the file an easier neighbour for an agent to edit.

def price_order(order, user, coupon, region):
    subtotal = sum(_price_item(item, user, region) for item in order.items)
    discount = _coupon_discount(order.items, coupon)
    return max(subtotal - discount, 0)

def _price_item(item, user, region):
    if item.kind == "digital":
        return item.price * (0.5 if user.is_student else 1.0)
    return item.price + _shipping(item, region)

def _shipping(item, region):
    if region != "EU":
        return 6
    return 10 if item.weight > 2 else 4

def _coupon_discount(items, coupon):
    if not coupon:
        return 0
    return sum(coupon.discount for item in items if coupon.applies_to(item))

Each helper now has a cyclomatic complexity of 2 or 3. The top level price_order reads like the spec for the feature. When I ask an agent to add a new region or a new item kind, it can patch a single helper without rereading the whole pricing rulebook.

Why This Matters More Now

Before agents, high complexity hurt at review time and at onboarding time. Both of those costs were amortised over months. Now the cost hits every single request. Every time an agent loads a file with cyclomatic 20 into context, I pay for the tokens, I pay for the planning mistakes, and I pay again when the first patch breaks a branch the model did not notice.

The upside is that the fix has also gotten cheaper. Agents are shockingly good at the mechanical part of complexity reduction, which is extracting small named functions and flattening nested conditionals. I now run a weekly pass on my hottest files, ask an agent to bring every function below a cyclomatic of 8 without changing behaviour, and review the diff. The first time I did this on a legacy module, maintainability index jumped by double digits and the next feature landed in a single prompt.

The Python Tools I Actually Use

None of this matters if I cannot measure it on a real codebase in under a minute. These are the libraries I reach for, all installable from PyPI.

radon

Radon is the fastest way to get cyclomatic complexity, Halstead metrics, and the maintainability index for a Python project. It is what I run in CI.

pip install radon

radon cc pricing.py -a -s
radon mi pricing.py -s
radon hal pricing.py
radon raw pricing.py

The cc command prints a letter grade per function (A through F) and an average for the file. mi prints the maintainability index. hal prints Halstead volume, difficulty, and effort. raw gives me LOC, LLOC, and comment ratios. I wire the -n flag in CI so the build fails if any function drops below a B grade.

Radon also ships a Python API if I want to score files inside an agent tool.

from radon.complexity import cc_visit
from radon.metrics import mi_visit, h_visit

source = open("pricing.py").read()

for block in cc_visit(source):
    print(block.name, block.complexity)

print("MI:", mi_visit(source, multi=True))
print("Halstead:", h_visit(source).total)

lizard

Lizard is language agnostic and handles Python, JavaScript, TypeScript, Go, Rust, and more in one pass. I use it when a repo is polyglot and I want a single report.

pip install lizard
lizard pricing.py -C 8 -L 60 -a 5

The flags set hard limits: cyclomatic complexity under 8, function length under 60 lines, arguments under 5. Anything above fails the run. Lizard is also faster than radon on very large trees because it streams files instead of parsing them into a full AST.

mccabe and ruff

mccabe is the original pyflakes plugin that powers flake8 --max-complexity. These days I usually reach for ruff, which implements the same check under rule C901 and runs orders of magnitude faster.

pip install ruff
ruff check pricing.py --select C901 --config 'lint.mccabe.max-complexity = 8'

This is the check I put in pre-commit. It catches new functions that blow past the threshold before they ever hit a PR.

cognitive_complexity

Cyclomatic is not enough on its own. The cognitive_complexity package implements the SonarSource algorithm for Python and plugs into flake8.

pip install cognitive_complexity flake8-cognitive-complexity
flake8 --max-cognitive-complexity=10 pricing.py

I set cognitive at 10 and cyclomatic at 8. If a function trips cognitive without tripping cyclomatic, I know the problem is nesting, not branching, and I fix it by flattening rather than by extracting.

wily

Wily tracks these metrics over time by walking the git history. It is how I spot files that are quietly getting worse.

pip install wily
wily build .
wily report pricing.py
wily diff pricing.py
wily graph pricing.py loc complexity

wily build indexes every revision. wily report shows how complexity has moved across commits. I run this once a month on every active repo and use the output to decide which file gets the next refactor pass.

Putting It Together

My standard setup for a new Python project looks like this:

ruff with C901 at 8 in pre-commit for the hard floor.
radon cc and radon mi in CI, failing the build on any function below B or any file whose maintainability index drops more than 5 points in a PR.
wily build on a weekly cron, feeding a small agent that picks the worst offender and proposes a refactor.
lizard on the polyglot monorepos where Python is only part of the story.

That is four tools, all free, all pip installable, and together they give me a complexity budget I can actually enforce instead of a vibe I can argue about.

How I Actually Reduce Complexity

Extract helpers until every function fits on one screen.
Replace nested conditionals with early returns or small dispatch tables.
Collapse boolean soup into named predicates so the conditions read like English.
Delete dead code the same day I notice it. Dead branches still count against cyclomatic.
Keep docstrings short and put the hard cases in tests, not comments.
Track the delta of the metrics on each PR, not the absolute value.

None of this is novel advice. What is new is that the return on investment has moved. Clean code used to be a gift to my future self. Now it is a gift to every agent that touches the repo, and those agents run all day.

Sources

Thomas J. McCabe, A Complexity Measure, IEEE Transactions on Software Engineering, 1976. IEEE Xplore
G. Ann Campbell, Cognitive Complexity: A New Way of Measuring Understandability, SonarSource white paper
Maurice Halstead, Elements of Software Science, 1977