Monitor Generation from Diffs: Self-Maintaining Production Systems

James Phoenix

Summary

When code changes, the observability surface should change with it. Instead of hand-writing monitors, an agent reads the PR diff on merge and generates monitors that instrument the new code. When a monitor fires, a second agent triages the alert, reproduces the issue in a sandbox, and proposes a fix. The result is a closed loop from code change to production fix with no human intervention until the PR review step.

The Pattern

PR merged
    ↓
Agent reads diff → generates monitors (Datadog, etc.)
    ↓
Monitor fires on production anomaly
    ↓
Agent triages: real issue or noise?
    ├── Noise → tune or delete the monitor
    └── Real → reproduce in sandbox → push fix PR → notify team

The key insight: monitors are generated artifacts of code changes, not hand-written infrastructure. They track the exact shape of the code. When that shape drifts from expected behavior, the system detects it.

Monitor Density as a Coverage Metric

Ramp Labs scaled to ~1,000 monitors for their Sheets product, roughly one monitor per 75 lines of code. Their hand-written monitors were broad (frontend crashes, API timeouts). The generated ones are granular, covering specific code paths and error conditions.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

This reframes observability the same way teams think about test coverage: not “do we have monitoring?” but “what percentage of our code surface is monitored?”

Why Unfocused QA Agents Fail

Ramp’s first attempt was a nightly QA agent with no specific mission. It found bugs, but always walked the same paths, reading the same files, investigating the same features. The failure mode:

Current models cannot synthesize a large codebase with a large observability surface and determine what needs attention. Prioritization at production scale requires focused, narrow signals, not broad exploration.

Monitor-driven maintenance solves this by giving the agent a specific alert with specific context. The agent does not need to figure out what to look at. The monitor tells it.

Noise Filtering

Auto-generated monitors have bad thresholds. Routine user activity triggers false positives. Two mechanisms handle this:

Triage step. On every alert, the agent first assesses whether the issue is real before acting. If noise, it tunes or deletes the monitor. This means the monitor set improves over time.
State on the monitor itself. When an agent pushes a fix, it appends the PR link to the monitor description. Subsequent agents see the link and stand down. Simple deduplication without a separate coordination layer.

Relationship to Existing Patterns

This pattern is adjacent to but distinct from several other approaches:

Closed-Loop Telemetry-Driven Optimization focuses on performance refactoring. This pattern focuses on incident detection and response.
Error Registry for Agents is an agent-facing knowledge store for errors already encountered. This pattern generates the detection infrastructure itself.
CI/CD Agent Patterns covers agent verification in CI. This pattern extends CI to generate production monitors on merge.

The combination is powerful: monitor generation catches new issues, the error registry prevents recurrence, and CI verification ensures the fix is sound.

Limitations

Monitor quality depends on model understanding of the diff. Subtle invariants will be missed.
Datadog (or equivalent) costs scale linearly with monitor count.
Auto-generated monitors are opaque. Keep hand-written monitors for critical paths as a safety net.
The “one monitor per 75 LOC” ratio is empirical from one team. The right density will vary by codebase.

Conversational Code Review applies the same “diff as seed for impact analysis” principle pre-merge, for interactive review of large PRs.

Source

Ramp Labs, “How we made Ramp Sheets self-maintaining” (March 2026). The article documents their progression from nightly QA agents to monitor-driven maintenance, including the noise filtering mechanisms and the empirical finding that Opus 4.6 outperformed GPT-5 at triage evaluation.