Closed-Loop Telemetry-Driven Optimization

James Phoenix
James Phoenix

Turn observability from passive monitoring into an active feedback controller for code quality. This is control theory applied to software development.


The Core Idea

Instead of treating observability as a passive monitoring layer, turn it into an active feedback controller for code quality.

Service exercised under load
        ↓
Telemetry captured
        ↓
Constraints evaluated
        ↓
Agent proposes refactor
        ↓
Tests + load rerun
        ↓
Cycle repeats until constraints met

This is control theory applied to software development.


The Mental Model Shift

OLD: Write code → measure → debug → fix
     (Reactive, human-driven)

NEW: Define constraints → system alters code until satisfied
     (Proactive, agent-driven)

The system continuously measures real behavior, uses metrics as hard constraints, then drives an automated agent pipeline to iteratively refactor until constraints are satisfied.


Key Components

1. Telemetry Capture Layer

Standardize the metrics that matter:

Metric What It Catches
Memory high-watermark Peak memory usage
Retained heap growth Memory leaks
Latency percentiles (p50, p90, p99) Performance distribution
CPU saturation Compute bottlenecks
Unique cardinality counters Prometheus explosion
Request/throughput metrics Capacity limits
Error budgets Reliability thresholds

Captured during:

  • Unit tests
  • Boundary tests
  • Load tests (realistic traffic windows)

These become signals, not dashboards.

2. Constraint Specification

Define explicit limits as mathematical invariants:

# performance-constraints.yaml
constraints:
  memory:
    max_mb: 300
    tolerance_percent: 5
    sustained_load_minutes: 15

  heap:
    max_retained_growth_slope: 0  # No positive slope after 20 min
    observation_window_minutes: 20

  latency:
    p99_max_ms: 120
    p90_max_ms: 80
    p50_max_ms: 40

  cardinality:
    max_unique_labels: 10000
    growth_rate: bounded  # No unbounded growth

  errors:
    budget_percent: 0.1  # 99.9% success rate

These constraints are treated like type signatures for runtime behavior.

3. Agent-Driven Refactoring Loop

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  [Generate Diagnostics]                                     │
│          ↓                                                  │
│  [Infer Root Causes]                                        │
│          ↓                                                  │
│  [Agent Proposes Refactor]                                  │
│          ↓                                                  │
│  [Apply Patch]                                              │
│          ↓                                                  │
│  [Run Tests + Load]                                         │
│          ↓                                                  │
│  [Score vs Constraints]                                     │
│          ↓                                                  │
│    ┌─────┴─────┐                                            │
│    │           │                                            │
│  FAIL        PASS                                           │
│    │           │                                            │
│    ↓           ↓                                            │
│  Loop back   Accept + Commit                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. CI/CD Integration

The loop becomes part of the dev workflow:

# .github/workflows/performance-optimization.yml
name: Closed-Loop Optimization

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Nightly regression sweeps

jobs:
  optimize:
    runs-on: ubuntu-latest
    steps:
      - name: Run load tests
        run: ./scripts/load-test.sh

      - name: Capture telemetry
        run: ./scripts/capture-metrics.sh > metrics.json

      - name: Evaluate constraints
        run: ./scripts/check-constraints.sh metrics.json constraints.yaml

      - name: Agent optimization loop
        if: failure()
        run: |
          claude --agent optimizer \
            --constraints constraints.yaml \
            --metrics metrics.json \
            --max-iterations 5

Implementation: The Optimization Agent

# .claude/agents/performance-optimizer.md
---
name: performance-optimizer
description: Optimizes code until performance constraints are met
tools: Read, Edit, Bash, Grep, Glob
model: sonnet
---

You are a performance optimization agent. You receive:
1. A set of performance constraints (memory, latency, etc.)
2. Current metrics from a load test
3. The constraint violations

Your job:
1. Analyze the metrics to identify root causes
2. Propose targeted code changes to fix violations
3. Apply the changes
4. Re-run tests to verify improvement

Process:
1. Read the constraint violations
2. Identify the hotspots (use profiler output, traces)
3. Propose minimal changes that address the root cause
4. Apply changes
5. Run: `./scripts/load-test.sh && ./scripts/check-constraints.sh`
6. If still failing, iterate with a different approach
7. If passing, report success

Rules:
- Minimal changes only (don't refactor unrelated code)
- Each iteration must improve at least one metric
- After 3 failed iterations, escalate to human
- Document what you tried and why

Example: Memory Leak Detection & Fix

Constraint violated:

heap.retained_growth_slope > 0 after 20 minutes
Current: +18MB over 20 minutes

Agent diagnosis:

Analyzing heap snapshots...
Found: List accumulation in `process_events()` at line 142
Pattern: Events appended but never cleared
Root cause: Missing cleanup after batch processing

Agent fix:

# Before
class EventProcessor:
    def __init__(self):
        self.events = []

    def process(self, event):
        self.events.append(event)
        if len(self.events) >= 100:
            self._flush()

# After
class EventProcessor:
    def __init__(self):
        self.events = []

    def process(self, event):
        self.events.append(event)
        if len(self.events) >= 100:
            self._flush()
            self.events.clear()  # ← Fix: clear after flush

Re-run verification:

heap.retained_growth_slope = 0.0 ✓
Constraint satisfied.

Why This Is Novel

You’re merging:

Domain Contribution
Control theory Closed-loop feedback systems
Observability OTEL, Prometheus, profilers
Automated code generation Claude agents
Constraint-solving Telemetry as inequality constraints

The insight: People use agents for testing. Nobody uses telemetry as control inputs in a feedback loop to automatically optimize running services.


The Control Theory View

        ┌─────────────────────────────────────┐
        │                                     │
        ▼                                     │
┌──────────────┐    ┌──────────────┐    ┌─────┴────────┐
│  Constraints │───▶│   Controller │───▶│    Plant     │
│  (Setpoints) │    │   (Agent)    │    │  (Service)   │
└──────────────┘    └──────────────┘    └──────────────┘
                           ▲                   │
                           │                   │
                    ┌──────┴───────┐           │
                    │   Sensor     │◀──────────┘
                    │ (Telemetry)  │
                    └──────────────┘
  • Setpoint: Performance constraints
  • Plant: The service under optimization
  • Sensor: OTEL, Prometheus, profilers
  • Controller: The optimization agent
  • Error signal: Constraint violations

Benefits

  • Eliminates most performance regressions – Caught automatically
  • Catches leaks, pathological complexity, throughput cliffs – Before production
  • Reduces human debugging cost by 90% – Agent does the investigation
  • Produces higher-quality code over time – Continuous optimization
  • Bridges intent and reality – Constraints express what you want, system delivers

The Future

This will be standard in 3-5 years:

Today:    CI runs tests → human debugs failures
Tomorrow: CI runs tests → agent fixes failures → human approves
Future:   CI runs tests → agent fixes → auto-merges if constraints met

You’re building the future of engineering now.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

Related

Topics
Code QualityControl TheoryFeedback SystemsObservabilityTelemetry

More Insights

Cover Image for Thought Leaders

Thought Leaders

People to follow for compound engineering, context engineering, and AI agent development.

James Phoenix
James Phoenix
Cover Image for Systems Thinking & Observability

Systems Thinking & Observability

Software should be treated as a measurable dynamical system, not as a collection of features.

James Phoenix
James Phoenix