
Ethan WalkerThe gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%,...
The gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%, cleared the bar, went out. A fixed pass-rate gate catches collapses, not drift. This was drift, and it walked right through.
The eval had sat at 96-97% for weeks. A retrieval change knocked one slice (ambiguous refund requests) from 98% to 74%. That slice is 4% of traffic, so the aggregate only fell to 91%. Above 90, so the gate stayed green. The aggregate did exactly what aggregates do: it averaged a real failure into noise.
The users hitting that slice did not experience a 91%. They experienced a 74%.
A static threshold answers one question: did the whole thing fall off a cliff. It says nothing about whether a specific slice quietly got worse while everything else held it up. If 96 of your slices are fine and one craters, a high floor hides the crater. You find out from a support ticket, not from CI.
We stopped gating on an absolute number and started gating against the last passing run. Two rules, both have to hold:
def gate(current, baseline):
failures = []
for slice_name, score in current.slices.items():
prev = baseline.slices.get(slice_name)
if prev is not None and prev - score > 3.0:
failures.append((slice_name, prev, score))
if baseline.aggregate - current.aggregate > 1.5:
failures.append(("AGGREGATE", baseline.aggregate, current.aggregate))
return failures # empty == pass
The refund slice dropping 24 points would have failed rule 1 on the first run, regardless of where the aggregate landed.
Delta gating breaks the moment your baseline drifts down with you. If the baseline updates on every run, a 0.5-point slide each day passes every single time and you ratchet straight into a regression over two weeks. Slow drift is invisible to a gate that keeps moving its own goalposts.
So the baseline updates only when main is green, and any intentional drop needs a human to approve it before it becomes the new floor. The baseline is a record of verified-good, not a record of most-recent.