Histogram Metrics in Batch Workloads

James Phoenix
James Phoenix

The Problem: Phantom Latency Spikes

When analyzing Prometheus histogram metrics (p50, p95, p99) for batch/nightly jobs, you’ll often see latency “spike” after the workload completes. This is a statistical artifact, not a real performance degradation.

Why This Happens

Prometheus histograms calculate percentiles over a sliding time window (typically 5-15 minutes). When throughput drops to zero:

  1. No new fast requests arrive to “dilute” the histogram
  2. The window becomes dominated by slow stragglers from the batch tail
  3. Percentiles drift upward toward the slowest observations
During high throughput:              After throughput drops:

100 samples in window:               4 samples remaining:
[30s, 45s, 50s, 55s, 60s,           [90s, 120s, 180s, 240s]
 65s, 70s, 90s, 120s, 180s]

p95 = ~2 min (reasonable)           p95 = ~4 min (!)

Key Insight

High throughput masks tail latency. Low throughput exposes it.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

Neither is wrong – it’s how percentile math works with finite samples. The slow requests were always there, just hidden by the volume of faster ones.

Rules for Reading Batch Metrics

        ┌─────────────────┐
        │  TRUST THIS     │ ← High throughput, statistically significant
        │  (100+ jobs/min)│
────────┘                 └────────────────
   ↑                              ↑
Ramp-up                      Ramp-down
(noisy)                      (misleading)
  1. Read metrics during steady-state load – not ramp-up or ramp-down
  2. p50/p95/p99 need volume – only meaningful when throughput > ~10-20 samples/window
  3. Zero throughput = stale data – histogram shows old observations until window expires
  4. Post-batch “spikes” are cosmetic – they reveal true tail, not new problems

Practical Application

When evaluating pipeline health for a nightly batch job:

Phase Throughput Metrics Reliability
Ramp-up Increasing Noisy, ignore
Steady-state Stable, high Trust these
Ramp-down Decreasing Misleading
Post-batch Zero Stale/artifact

Example

A batch job runs 22:24-22:26 with ~100 jobs/min, then throughput drops to zero.

  • 22:24-22:26: p95 = 2 min (real performance)
  • 22:30+: p95 = 4 min (histogram artifact from emptying window)

The pipeline was healthy. The “spike” at 22:30 is just the histogram exposing previously-hidden tail latency from the batch.

Tags

#observability #prometheus #histograms #batch-processing #metrics

Topics
Batch ProcessingHistogram MetricsLatency AnalysisPerformance ArtifactsPrometheus

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for The Semantic Triangle: Mock Screens, PoC Backend, and Spec File Beat Any One Alone

The Semantic Triangle: Mock Screens, PoC Backend, and Spec File Beat Any One Alone

Three artefacts. Three reduced ambiguities. One projection task instead of three inventions.

James Phoenix
James Phoenix
Cover Image for Contracts Parallelize Agents

Contracts Parallelize Agents

If you’re waiting for Agent A to finish before starting Agent B, you’re wasting time. Define the contract between them and dispatch both now.

James Phoenix
James Phoenix