Histogram Metrics in Batch Workloads

James Phoenix
James Phoenix

The Problem: Phantom Latency Spikes

When analyzing Prometheus histogram metrics (p50, p95, p99) for batch/nightly jobs, you’ll often see latency “spike” after the workload completes. This is a statistical artifact, not a real performance degradation.

Why This Happens

Prometheus histograms calculate percentiles over a sliding time window (typically 5-15 minutes). When throughput drops to zero:

  1. No new fast requests arrive to “dilute” the histogram
  2. The window becomes dominated by slow stragglers from the batch tail
  3. Percentiles drift upward toward the slowest observations
During high throughput:              After throughput drops:

100 samples in window:               4 samples remaining:
[30s, 45s, 50s, 55s, 60s,           [90s, 120s, 180s, 240s]
 65s, 70s, 90s, 120s, 180s]

p95 = ~2 min (reasonable)           p95 = ~4 min (!)

Key Insight

High throughput masks tail latency. Low throughput exposes it.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

Neither is wrong – it’s how percentile math works with finite samples. The slow requests were always there, just hidden by the volume of faster ones.

Rules for Reading Batch Metrics

        ┌─────────────────┐
        │  TRUST THIS     │ ← High throughput, statistically significant
        │  (100+ jobs/min)│
────────┘                 └────────────────
   ↑                              ↑
Ramp-up                      Ramp-down
(noisy)                      (misleading)
  1. Read metrics during steady-state load – not ramp-up or ramp-down
  2. p50/p95/p99 need volume – only meaningful when throughput > ~10-20 samples/window
  3. Zero throughput = stale data – histogram shows old observations until window expires
  4. Post-batch “spikes” are cosmetic – they reveal true tail, not new problems

Practical Application

When evaluating pipeline health for a nightly batch job:

Phase Throughput Metrics Reliability
Ramp-up Increasing Noisy, ignore
Steady-state Stable, high Trust these
Ramp-down Decreasing Misleading
Post-batch Zero Stale/artifact

Example

A batch job runs 22:24-22:26 with ~100 jobs/min, then throughput drops to zero.

  • 22:24-22:26: p95 = 2 min (real performance)
  • 22:30+: p95 = 4 min (histogram artifact from emptying window)

The pipeline was healthy. The “spike” at 22:30 is just the histogram exposing previously-hidden tail latency from the batch.

Tags

#observability #prometheus #histograms #batch-processing #metrics

Topics
MonitoringPerformance

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for Computer Use Kills the Config Tax, Not the Trust Tax

Computer Use Kills the Config Tax, Not the Trust Tax

My sister hates job applications because they make her re-submit information she already has. That is the same pain as API app review, and the same agent that lives in my codebase can dissolve both. This feels insane, and it is the new default shape of the work.

James Phoenix
James Phoenix
Cover Image for Sentry Errors Should Spawn Agents on Your Own Machine

Sentry Errors Should Spawn Agents on Your Own Machine

A new production error is an event. Events should trigger work, not sit in a dashboard. So I wired Sentry to spawn a coding agent on my own hardware, point it at my exact stack, and open a draft PR with a fix.

James Phoenix
James Phoenix