Why P95 Latency Is the Only Metric That Matters at 3 AM

# backend# monitoring# performance# sre

Lenard Francis

If your checkout endpoint serves 10,000 requests per minute, a 5% latency spike means 500 users are...

If your checkout endpoint serves 10,000 requests per minute, a 5% latency spike means 500 users are having a bad experience every minute.

Averages compress that pain into a single comfortable number.
P95 latency — the latency at the 95th percentile — tells you what your slowest users are actually experiencing.

It's the metric that catches the spike average hides.
This is why I track P95 as the primary health signal, not averages.

How Latency Spikes Actually Propagate
A latency spike rarely starts in your application.It usually starts somewhere else and cascades inward.

The typical pattern looks like this:

Slow upstream dependency
↓
Connection pool saturation
↓
Request queue growth
↓
Latency spike propagation
↓
Timeouts and failures

The Cascade Pattern
An upstream dependency (database, payment gateway, third-party API) slows down
Your FastAPI app keeps accepting requests while waiting for responses.
Your connection pool fills up – new requests queue behind existing ones.
Queue depth grows, memory pressure builds
Response times climb across all endpoints, not just the affected one. Eventually requests start timing out or failing entirely

By stage 3, you have a problem. By stage 5, your customers know about it before you do.
The cascade failure pattern is particularly nasty.A slow database query holds a connection.

That held connection blocks another request. That blocked request ties up execution capacity. Multiply that by concurrent users and you get full service degradation from a single slow dependency.

Under async workloads, the failure mode becomes especially deceptive because the application continues accepting requests while upstream awaits accumulation in the background.

High Traffic Spikes Make This Worse. Under normal load, a slow upstream dependency is annoying.
Under a traffic spike, it's catastrophic.

Here's why:

Connection pool saturation happens faster. If you have 20 database connections and traffic doubles, you hit the ceiling twice as fast.
Queue depth explodes. Requests piling up behind a slow dependency compound each other's wait time.
Memory pressure builds. Each queued request holds state. Enough of them and you drift toward OOM territory.
Recovery is non-linear. Once a connection pool is saturated, it often stays saturated even after the upstream issue resolves — because the backlog keeps it full.

The cruel irony is that traffic spikes happen when your service matters most.

A flash sale. A viral moment. A major announcement.
Exactly the wrong time to be debugging latency from a dashboard.

What Didn't Work For Me

Monitoring sounds easy in theory. In practice, most setups failed me in one of four ways.

Prometheus + Grafana. Powerful, but operationally heavy.

Setting up exporters, configuring dashboards, maintaining the stack — all before writing a single alert rule.

And when the alert fires at 3am, one still has to log in and interpret charts under pressure.

Simple Health Checks

GET /health → 200 OK tells you the service is alive.
It doesn't tell you it's running at 8x normal latency while technically responding.

Average Latency Monitoring

Averages mask the spikes that actually hurt users.

In one case, a payment provider slowdown pushed P95 latency from roughly 180 ms to over 2 seconds within minutes — while average latency still looked acceptable.

By the time averages reflected the issue, checkout failures had already started.

Alert Fatigue

I added more monitors to catch more things. Which meant more alerts. Most of them were noise. When everything is urgent, nothing is. Monitoring systems usually optimise for data collection.

Operators actually need decision compression.

What I Built Instead

I wanted something that:
Tracked P95, not averages
Produced a single health score instead of 15 metrics to interpret
Caught degradation trends early, before full failure
Required zero config to add to an existing FastAPI app

The result is a FastAPI middleware that continuously computes degradation signals directly from live request traffic.

from fastapi import FastAPI
from fastapi_alertengine import instrument

app = FastAPI()
instrument(app)

The middleware exposes a structured /health/alerts endpoint:

{
"status": "warning",
"health_score": {
"score": 61,
"trend": "degrading"
},
"metrics": {
"overall_p95_ms": 1847.3,
"error_rate": 0.08,
"anomaly_score": 0.9
}
}

One status. One score. One trend direction. No dashboards to configure. No agents to run. No Prometheus exporters.

The Human-in-the-Loop Layer

Once I had a reliable health signal, the next question was:
What do I do with it?

I built a managed orchestration layer that polls /health/alerts every 5 seconds. When the score drops below the threshold, it:

Runs Claude AI diagnosis on the metric context
Sends a WhatsApp or Telegram message (or Slack) with a plain-English summary
Generates a single-use recovery link

Most AI incident tooling jumps straight to autonomous remediation. I intentionally didn't.

Production systems deserve human authorisation before recovery actions execute. I read the diagnosis, preview the recovery action, and tap approve – all from my phone.

Nothing executes automatically. Every action is logged immutably.

I built the mobile-first delivery because I work in Zimbabwe, where engineers aren't always at laptops when things break.

WhatsApp is the operational control plane here.

That constraint produced something better than I expected:

Alerts that find you, rather than dashboards you have to find.

The Open Source Core
The telemetry middleware is free and MIT licensed.
pip install fastapi-alertengine

The managed orchestration layer (AI diagnosis, WhatsApp/Telegram alerts, and human-authorised recovery) is a commercial service.

GitHub: https://github.com/Tandem-Media/fastapi-alertengine
Docs: https://tandem-media.github.io/fastapi-alertengine/
Youtube: https://youtu.be/vKLqcVdSMO8?si=eMU3Fm_WPmJTQi2Y

Most monitoring stacks are good at detecting incidents.
Very few are good at reducing operator uncertainty during one.
How are you handling that gap today?