The Problem: Phantom Latency Spikes
When analyzing Prometheus histogram metrics (p50, p95, p99) for batch/nightly jobs, you’ll often see latency “spike” after the workload completes. This is a statistical artifact, not a real performance degradation.
Why This Happens
Prometheus histograms calculate percentiles over a sliding time window (typically 5-15 minutes). When throughput drops to zero:
- No new fast requests arrive to “dilute” the histogram
- The window becomes dominated by slow stragglers from the batch tail
- Percentiles drift upward toward the slowest observations
During high throughput: After throughput drops:
100 samples in window: 4 samples remaining:
[30s, 45s, 50s, 55s, 60s, [90s, 120s, 180s, 240s]
65s, 70s, 90s, 120s, 180s]
p95 = ~2 min (reasonable) p95 = ~4 min (!)
Key Insight
High throughput masks tail latency. Low throughput exposes it.
Neither is wrong – it’s how percentile math works with finite samples. The slow requests were always there, just hidden by the volume of faster ones.
Rules for Reading Batch Metrics
┌─────────────────┐
│ TRUST THIS │ ← High throughput, statistically significant
│ (100+ jobs/min)│
────────┘ └────────────────
↑ ↑
Ramp-up Ramp-down
(noisy) (misleading)
- Read metrics during steady-state load – not ramp-up or ramp-down
- p50/p95/p99 need volume – only meaningful when throughput > ~10-20 samples/window
- Zero throughput = stale data – histogram shows old observations until window expires
- Post-batch “spikes” are cosmetic – they reveal true tail, not new problems
Practical Application
When evaluating pipeline health for a nightly batch job:
| Phase | Throughput | Metrics Reliability |
|---|---|---|
| Ramp-up | Increasing | Noisy, ignore |
| Steady-state | Stable, high | Trust these |
| Ramp-down | Decreasing | Misleading |
| Post-batch | Zero | Stale/artifact |
Example
A batch job runs 22:24-22:26 with ~100 jobs/min, then throughput drops to zero.
- 22:24-22:26: p95 = 2 min (real performance)
- 22:30+: p95 = 4 min (histogram artifact from emptying window)
The pipeline was healthy. The “spike” at 22:30 is just the histogram exposing previously-hidden tail latency from the batch.
Tags
#observability #prometheus #histograms #batch-processing #metrics

