Histogram Metrics in Batch Workloads

James Phoenix
James Phoenix

The Problem: Phantom Latency Spikes

When analyzing Prometheus histogram metrics (p50, p95, p99) for batch/nightly jobs, you’ll often see latency “spike” after the workload completes. This is a statistical artifact, not a real performance degradation.

Why This Happens

Prometheus histograms calculate percentiles over a sliding time window (typically 5-15 minutes). When throughput drops to zero:

  1. No new fast requests arrive to “dilute” the histogram
  2. The window becomes dominated by slow stragglers from the batch tail
  3. Percentiles drift upward toward the slowest observations
During high throughput:              After throughput drops:

100 samples in window:               4 samples remaining:
[30s, 45s, 50s, 55s, 60s,           [90s, 120s, 180s, 240s]
 65s, 70s, 90s, 120s, 180s]

p95 = ~2 min (reasonable)           p95 = ~4 min (!)

Key Insight

High throughput masks tail latency. Low throughput exposes it.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

Neither is wrong – it’s how percentile math works with finite samples. The slow requests were always there, just hidden by the volume of faster ones.

Rules for Reading Batch Metrics

        ┌─────────────────┐
        │  TRUST THIS     │ ← High throughput, statistically significant
        │  (100+ jobs/min)│
────────┘                 └────────────────
   ↑                              ↑
Ramp-up                      Ramp-down
(noisy)                      (misleading)
  1. Read metrics during steady-state load – not ramp-up or ramp-down
  2. p50/p95/p99 need volume – only meaningful when throughput > ~10-20 samples/window
  3. Zero throughput = stale data – histogram shows old observations until window expires
  4. Post-batch “spikes” are cosmetic – they reveal true tail, not new problems

Practical Application

When evaluating pipeline health for a nightly batch job:

Phase Throughput Metrics Reliability
Ramp-up Increasing Noisy, ignore
Steady-state Stable, high Trust these
Ramp-down Decreasing Misleading
Post-batch Zero Stale/artifact

Example

A batch job runs 22:24-22:26 with ~100 jobs/min, then throughput drops to zero.

  • 22:24-22:26: p95 = 2 min (real performance)
  • 22:30+: p95 = 4 min (histogram artifact from emptying window)

The pipeline was healthy. The “spike” at 22:30 is just the histogram exposing previously-hidden tail latency from the batch.

Tags

#observability #prometheus #histograms #batch-processing #metrics

Topics
MonitoringPerformance

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for The Environment Leads The Agent

The Environment Leads The Agent

For a long time I tried to lead my coding agents with better and better prompts, and they kept drifting. What finally worked was the opposite move. As I optimised the boilerplate of the repository I was building, I kept pushing each lesson I learned down into the floor of the repo: hermetic environments, typed contracts, mechanical lint, integration-first tests, queryable telemetry. Somewhere along the way the repo itself became the thing steering the agent, and it asks me what to do far less than it used to. This is the journey that got me there, and what it taught me.

James Phoenix
James Phoenix
Cover Image for Your Own Life Is a Queryable, Validated Corpus

Your Own Life Is a Queryable, Validated Corpus

Your private data exhaust deserves the same treatment as production data: indexed, validated, version-controlled, and queried by an agent. Once you make that move, writing a song, paying a tax bill, and updating a CV all become the same engineering problem.

James Phoenix
James Phoenix