Closed-Loop Telemetry-Driven Optimization

James Phoenix

Turn observability from passive monitoring into an active feedback controller for code quality. This is control theory applied to software development.

The Core Idea

Instead of treating observability as a passive monitoring layer, turn it into an active feedback controller for code quality.

Service exercised under load
        ↓
Telemetry captured
        ↓
Constraints evaluated
        ↓
Agent proposes refactor
        ↓
Tests + load rerun
        ↓
Cycle repeats until constraints met

This is control theory applied to software development.

The Mental Model Shift

OLD: Write code → measure → debug → fix
     (Reactive, human-driven)

NEW: Define constraints → system alters code until satisfied
     (Proactive, agent-driven)

The system continuously measures real behavior, uses metrics as hard constraints, then drives an automated agent pipeline to iteratively refactor until constraints are satisfied.

Key Components

1. Telemetry Capture Layer

Standardize the metrics that matter:

Metric	What It Catches
Memory high-watermark	Peak memory usage
Retained heap growth	Memory leaks
Latency percentiles (p50, p90, p99)	Performance distribution
CPU saturation	Compute bottlenecks
Unique cardinality counters	Prometheus explosion
Request/throughput metrics	Capacity limits
Error budgets	Reliability thresholds

Captured during:

Unit tests
Boundary tests
Load tests (realistic traffic windows)

These become signals, not dashboards.

2. Constraint Specification

Define explicit limits as mathematical invariants:

# performance-constraints.yaml
constraints:
  memory:
    max_mb: 300
    tolerance_percent: 5
    sustained_load_minutes: 15

  heap:
    max_retained_growth_slope: 0  # No positive slope after 20 min
    observation_window_minutes: 20

  latency:
    p99_max_ms: 120
    p90_max_ms: 80
    p50_max_ms: 40

  cardinality:
    max_unique_labels: 10000
    growth_rate: bounded  # No unbounded growth

  errors:
    budget_percent: 0.1  # 99.9% success rate

These constraints are treated like type signatures for runtime behavior.

3. Agent-Driven Refactoring Loop

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  [Generate Diagnostics]                                     │
│          ↓                                                  │
│  [Infer Root Causes]                                        │
│          ↓                                                  │
│  [Agent Proposes Refactor]                                  │
│          ↓                                                  │
│  [Apply Patch]                                              │
│          ↓                                                  │
│  [Run Tests + Load]                                         │
│          ↓                                                  │
│  [Score vs Constraints]                                     │
│          ↓                                                  │
│    ┌─────┴─────┐                                            │
│    │           │                                            │
│  FAIL        PASS                                           │
│    │           │                                            │
│    ↓           ↓                                            │
│  Loop back   Accept + Commit                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. CI/CD Integration

The loop becomes part of the dev workflow:

# .github/workflows/performance-optimization.yml
name: Closed-Loop Optimization

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Nightly regression sweeps

jobs:
  optimize:
    runs-on: ubuntu-latest
    steps:
      - name: Run load tests
        run: ./scripts/load-test.sh

      - name: Capture telemetry
        run: ./scripts/capture-metrics.sh > metrics.json

      - name: Evaluate constraints
        run: ./scripts/check-constraints.sh metrics.json constraints.yaml

      - name: Agent optimization loop
        if: failure()
        run: |
          claude --agent optimizer \
            --constraints constraints.yaml \
            --metrics metrics.json \
            --max-iterations 5

Implementation: The Optimization Agent

# .claude/agents/performance-optimizer.md
---
name: performance-optimizer
description: Optimizes code until performance constraints are met
tools: Read, Edit, Bash, Grep, Glob
model: sonnet
---

You are a performance optimization agent. You receive:
1. A set of performance constraints (memory, latency, etc.)
2. Current metrics from a load test
3. The constraint violations

Your job:
1. Analyze the metrics to identify root causes
2. Propose targeted code changes to fix violations
3. Apply the changes
4. Re-run tests to verify improvement

Process:
1. Read the constraint violations
2. Identify the hotspots (use profiler output, traces)
3. Propose minimal changes that address the root cause
4. Apply changes
5. Run: `./scripts/load-test.sh && ./scripts/check-constraints.sh`
6. If still failing, iterate with a different approach
7. If passing, report success

Rules:
- Minimal changes only (don't refactor unrelated code)
- Each iteration must improve at least one metric
- After 3 failed iterations, escalate to human
- Document what you tried and why

Example: Memory Leak Detection & Fix

Constraint violated:

heap.retained_growth_slope > 0 after 20 minutes
Current: +18MB over 20 minutes

Agent diagnosis:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Analyzing heap snapshots...
Found: List accumulation in `process_events()` at line 142
Pattern: Events appended but never cleared
Root cause: Missing cleanup after batch processing

Agent fix:

# Before
class EventProcessor:
    def __init__(self):
        self.events = []

    def process(self, event):
        self.events.append(event)
        if len(self.events) >= 100:
            self._flush()

# After
class EventProcessor:
    def __init__(self):
        self.events = []

    def process(self, event):
        self.events.append(event)
        if len(self.events) >= 100:
            self._flush()
            self.events.clear()  # ← Fix: clear after flush

Re-run verification:

heap.retained_growth_slope = 0.0 ✓
Constraint satisfied.

Why This Is Novel

You’re merging:

Domain	Contribution
Control theory	Closed-loop feedback systems
Observability	OTEL, Prometheus, profilers
Automated code generation	Claude agents
Constraint-solving	Telemetry as inequality constraints

The insight: People use agents for testing. Nobody uses telemetry as control inputs in a feedback loop to automatically optimize running services.

The Control Theory View

        ┌─────────────────────────────────────┐
        │                                     │
        ▼                                     │
┌──────────────┐    ┌──────────────┐    ┌─────┴────────┐
│  Constraints │───▶│   Controller │───▶│    Plant     │
│  (Setpoints) │    │   (Agent)    │    │  (Service)   │
└──────────────┘    └──────────────┘    └──────────────┘
                           ▲                   │
                           │                   │
                    ┌──────┴───────┐           │
                    │   Sensor     │◀──────────┘
                    │ (Telemetry)  │
                    └──────────────┘

Setpoint: Performance constraints
Plant: The service under optimization
Sensor: OTEL, Prometheus, profilers
Controller: The optimization agent
Error signal: Constraint violations

Benefits

Eliminates most performance regressions – Caught automatically
Catches leaks, pathological complexity, throughput cliffs – Before production
Reduces human debugging cost by 90% – Agent does the investigation
Produces higher-quality code over time – Continuous optimization
Bridges intent and reality – Constraints express what you want, system delivers

The Future

This will be standard in 3-5 years:

Today:    CI runs tests → human debugs failures
Tomorrow: CI runs tests → agent fixes failures → human approves
Future:   CI runs tests → agent fixes → auto-merges if constraints met

You’re building the future of engineering now.

Building the Harness – Layer 4: Telemetry-driven optimization
Constraint-First Development – Defining constraints
Agent Capabilities – Telemetry as eyes
The Verification Ladder – Quality gates