Ethan WalkerOur eval dashboard said 94%. Green checkmark, merge button unlocked, everyone moved on. Three days...
Our eval dashboard said 94%. Green checkmark, merge button unlocked, everyone moved on. Three days later a customer forwarded us a transcript where our support agent had pasted another user's account ID and partial billing address into a response. Not a jailbreak, not adversarial input, just a normal support query where the agent's tool-calling step grabbed the wrong record and included it verbatim in a "helpful" summary.
We went back to the eval run that had passed. Out of 512 test cases, 31 failed for one reason or another (phrasing too verbose, wrong tone, minor factual softening). Six of those 31 failures were the PII leak pattern. Six out of 512 is a rounding error against a flat pass-rate metric. It's also, in my opinion, the only failure category in that entire run that should have blocked the deploy on its own.
That's the problem with a single threshold on a flat pass rate: it assumes all failures cost the same. They don't. A verbose answer costs you a slightly annoyed user. A PII leak costs you a disclosure obligation and possibly a very bad week. Averaging them into one number is a category error, and it's one I'd guess most teams running LLM-as-judge pipelines are making right now without realizing it, because building a flat pass rate is the default output of every eval framework I've used (DeepEval, Promptfoo, LangSmith all give you this by default; none of them force severity weighting on you).
[IMAGE: https://lh3.googleusercontent.com/d/1Xf1IIcHzCOOSS4EsN5wW5PPS_fVxTZ9U]
Think about the arithmetic. If you have 500 test cases and your threshold is "pass rate must be at least 90%," you can absorb 50 failures before the gate trips. If your failure distribution is mostly benign (wrong tone, slightly long response, minor formatting), the gate is well calibrated for that. But the moment even a handful of those 50 allowed failures belong to a catastrophic category (irreversible action taken, PII disclosed, a factual claim that could cause real financial harm), a flat threshold has no way to tell the difference. It just counts.
The other failure mode we hit: our test suite was itself imbalanced. We had roughly 40 test cases probing tone and style for every 1 test case probing PII handling or destructive tool calls, because tone issues are easy to write test cases for and someone had clearly optimized for coverage breadth over risk coverage. So even a perfect recall on the severe category could get statistically drowned by the benign category in an aggregate score.
We now tag every eval test case with a severity level at write time, not after an incident forces us to retrofit it:
Then we compute both the flat pass rate (for visibility, it's still a useful trend line) and a severity-weighted score, and we gate CI on the weighted score, not the flat one.
"""
severity_gate.py
Severity-weighted eval scoring. Computes both the flat pass rate
and a blast-radius-weighted score, and fails CI on the weighted score
even when the flat rate looks healthy.
"""
from dataclasses import dataclass
SEVERITY_WEIGHTS = {1: 1, 2: 4, 3: 20} # tune per your own risk tolerance
@dataclass
class TestCaseResult:
name: str
correct: bool
severity: int # 1, 2, or 3
def flat_pass_rate(results: list[TestCaseResult]) -> float:
return sum(r.correct for r in results) / len(results)
def severity_weighted_score(results: list[TestCaseResult]) -> float:
"""
Returns a score in [0, 1]. Each failure subtracts its severity weight
from the achievable total, so one severity-3 failure costs as much
as 20 severity-1 failures.
"""
total_weight = sum(SEVERITY_WEIGHTS[r.severity] for r in results)
earned_weight = sum(SEVERITY_WEIGHTS[r.severity] for r in results if r.correct)
return earned_weight / total_weight
def run_severity_gate(results: list[TestCaseResult], weighted_threshold: float = 0.98) -> None:
flat = flat_pass_rate(results)
weighted = severity_weighted_score(results)
severe_failures = [r for r in results if not r.correct and r.severity == 3]
print(f"flat pass rate: {flat:.3f}")
print(f"severity-weighted score: {weighted:.4f} (threshold {weighted_threshold})")
if severe_failures:
print(f"severity-3 failures ({len(severe_failures)}):")
for r in severe_failures:
print(f" - {r.name}")
if weighted < weighted_threshold:
raise SystemExit(
f"SEVERITY GATE FAILED: weighted score {weighted:.4f} "
f"below {weighted_threshold}, despite flat pass rate {flat:.3f}"
)
if __name__ == "__main__":
# Reconstruction of the run that shipped the PII leak, anecdotally,
# from our postmortem numbers (512 cases, 31 failures, 6 severity-3)
results = (
[TestCaseResult(f"tone_{i}", correct=True, severity=1) for i in range(420)]
+ [TestCaseResult(f"tone_fail_{i}", correct=False, severity=1) for i in range(20)]
+ [TestCaseResult(f"fact_{i}", correct=True, severity=2) for i in range(56)]
+ [TestCaseResult(f"fact_fail_{i}", correct=False, severity=2) for i in range(5)]
+ [TestCaseResult(f"pii_{i}", correct=True, severity=3) for i in range(5)]
+ [TestCaseResult(f"pii_fail_{i}", correct=False, severity=3) for i in range(6)]
)
run_severity_gate(results, weighted_threshold=0.98)
Run against our reconstructed postmortem numbers, flat pass rate comes out to about 0.94 (481 of 512), which is exactly what shipped. The severity-weighted score comes out 0.823, well under a 0.98 threshold, because those six severity-3 failures each cost 20x what a tone nitpick costs. That gate would have blocked the merge.
I'll flag the obvious weak point: the weights in SEVERITY_WEIGHTS are a judgment call, not a derived constant. We set severity-3 at 20x severity-1 after arguing about it for most of an afternoon, using rough numbers from what a support escalation and a compliance review actually cost us in engineering hours the last time something like this happened. Another team might reasonably land on 10x or 50x. What matters isn't the exact ratio, it's that the ratio is explicit and versioned in the repo instead of implicit in whoever eyeballs the dashboard that week.
We also had to fix the test suite imbalance separately. Weighting doesn't help if you only have 6 severity-3 test cases total and one of them is flaky. We're up to 40 severity-3 cases now, covering PII handling, destructive tool calls, and financial claims, added deliberately rather than as an afterthought to coverage metrics that were optimized for breadth.