Tuomo NikulainenWe compared heuristic failure detectors against LLM-as-judge on 7,212 agent traces. Heuristics scored 60.1% on TRAIL at $0 cost vs 11% for the best LLM.
The default approach to evaluating AI agents is to use another AI. LLM-as-judge. Feed the trace to a frontier model and ask "what went wrong?" It's intuitive, flexible, and expensive. It also underperforms purpose-built heuristics on most failure categories.
We know this because we tested both approaches systematically. Pisama has 18 production-grade heuristic detectors calibrated on 7,212 labeled entries from 13 external data sources. We benchmarked them against LLM judges on two public agent failure benchmarks. The results challenged our assumptions about when you need semantic reasoning and when simple pattern matching is enough.
This article presents the data, explains why heuristics outperform LLMs on structural failures, identifies the categories where LLMs are still essential, and describes the tiered architecture we settled on.
TRAIL, released by Patronus AI, contains 148 real agent execution traces with 841 human-labeled errors spanning 21 failure categories. It's designed to test whether systems can identify all failures in a given trace — not just one, but every issue present. This makes it harder than typical binary classification benchmarks.
The best published result from a frontier LLM is 11.0% joint accuracy (Gemini 2.5 Pro). Claude 3.7 Sonnet achieves 4.7%. OpenAI o3 achieves 9.2%. These are capable models performing poorly because the task requires systematic structural analysis, not the open-ended reasoning LLMs are optimized for.
Who&When, an ICML 2025 spotlight paper, tests a harder question: given a multi-agent conversation that failed, which agent caused the failure and at which step? This combines detection (something went wrong) with attribution (who's responsible and when did it happen).
Separately from these benchmarks, we maintain a golden dataset of 7,212 labeled entries across all 18 production detector categories. These entries come from 13 external sources including MAST-Data (NeurIPS 2025), AgentErrorBench, SWE-bench traces, GAIA traces, and real n8n workflow failures. We use this dataset for cross-validated calibration with per-difficulty stratification.
| Method | Joint Accuracy | Precision | Cost per Trace |
|---|---|---|---|
| Gemini 2.5 Pro | 11.0% | not reported | ~$0.05-0.15 |
| OpenAI o3 | 9.2% | not reported | ~$0.10-0.30 |
| Claude 3.7 Sonnet | 4.7% | not reported | ~$0.05-0.10 |
| Pisama heuristic | 60.1% | 100% | $0.00 |
The headline number: 5.5x better than the best LLM, at zero cost.
But the precision number matters more than the accuracy. When Pisama flags a failure, it's always correct (100% precision on TRAIL). The 40% of failures it misses are genuine misses — cases where the heuristic detectors don't have a matching pattern. These are the cases where LLM escalation adds value.
The per-category breakdown reveals why heuristics dominate:
| Failure Category | Pisama F1 | Best LLM F1 |
|---|---|---|
| Context Handling | 0.978 | 0.00 |
| Specification Compliance | 1.000 | N/A |
| Loop / Resource Abuse | 1.000 | ~0.30 |
| Tool Selection Errors | 1.000 | ~0.57 |
| Hallucination (language) | 0.884 | 0.59 |
| Goal Deviation | 0.829 | 0.70 |
Context handling — where LLMs score literally zero — is where heuristic detectors achieve near-perfect detection. The same pattern holds for loops, specification compliance, and tool errors. These categories have strong structural signals that pattern matchers extract reliably.
| Method | Agent Accuracy | Step Accuracy | Cost per Case |
|---|---|---|---|
| GPT-4o | 44.9% | 8.7% | ~$0.05 |
| o1 | 53.5% | 14.2% | ~$0.15 |
| Pisama heuristic-only | 31.0% | 16.8% | $0.000 |
| Pisama + Haiku 4.5 | 39.7% | 15.5% | $0.004 |
| Pisama + Sonnet 4 | 60.3% | 24.1% | $0.021 |
This benchmark tells a more nuanced story. Heuristic-only detection beats o1 on step localization (16.8% vs 14.2%) — finding when the failure happened is a structural question. But it trails on agent identification (31.0% vs 53.5%) — figuring out who's to blame requires reading comprehension and causal reasoning.
The hybrid approach — heuristics for detection, a single Sonnet call for attribution — beats every baseline at $0.02 per case.
Across our 7,212-entry golden dataset, mean F1 across 18 production detectors is 0.701 with cross-validation. The distribution:
Production tier (F1 >= 0.70):
Beta tier (F1 0.40-0.70):
These numbers represent heuristic-only performance on diverse, real-world data from external sources. No cherry-picking, no synthetic test cases. The variance across detector types is informative: structural failures (decomposition, corruption, coordination) are easier to catch with rules than semantic failures (communication, derailment).
Agent failures leave measurable traces that don't require language understanding to detect:
Loops are repeated states. If the same sequence of node visits or tool calls appears three times, that's a loop. A hash comparison catches exact repetition. Subsequence matching catches cycles. You don't need to "understand" that the agent is stuck — you need to measure state repetition. Pisama's loop detector achieves F1 1.000 on TRAIL's loop/resource abuse category.
Context neglect is missing coverage. If upstream context contains twelve specific data points — numbers, dates, proper nouns, URLs — and the downstream output references zero of them, context was ignored. This is an element extraction and coverage measurement, not a judgment call. F1: 0.978 on TRAIL's context handling category.
State corruption is type drift. If a field that was a float is now a string, or a non-null field just became null, or a value changed direction five times in two seconds, the state is corrupted. These are delta comparisons on structured data. F1: 0.909 on our calibration dataset.
Specification compliance is requirement coverage. Extract the requirements from the spec ("REST API", "JWT authentication", "PostgreSQL"). Check whether the output addresses each one. Stem matching and synonym expansion handle paraphrasing. This is information retrieval, not language understanding. F1: 1.000 on TRAIL.
The underlying principle comes from Gerd Gigerenzer's research on decision-making: in uncertain environments with high-dimensional inputs, simple rules that focus on the most diagnostic cue often outperform complex models that try to weight all available information. Agent failure detection is exactly this kind of problem. The traces are long and complex, but the failure signal is usually concentrated in one diagnostic feature — state repetition for loops, element coverage for context neglect, type changes for corruption.
A purpose-built heuristic that knows exactly which signal to extract will beat a general-purpose LLM that has to figure out what to look for in a 50,000-token trace.
Heuristics have clear limits. Two tasks consistently require LLM-level reasoning:
When three agents collaborate and the output is wrong, determining which agent caused the failure requires causal reasoning. "The WebSurfer clicked an irrelevant link" vs. "The Orchestrator gave unclear instructions" — distinguishing root cause from downstream consequence requires reading comprehension that heuristics can't provide.
This is exactly what the Who&When results show: heuristics match LLMs on step localization (a structural question) but trail on agent identification (a semantic question).
Heuristic detectors match known failure patterns. If an agent fails in a way that doesn't match any of the 18 defined patterns — a genuinely new failure mode — heuristics will miss it entirely. An LLM judge serves as a catch-all for out-of-distribution failures, trading cost for coverage.
"Is this summary good enough?" is not a question heuristics can answer. Detecting that a summary is incomplete (missing 4 of 10 required points) is a heuristic problem. Judging whether the summary is well-written is a semantic one.
The right approach isn't heuristics or LLMs. It's heuristics then LLMs, with escalation based on confidence.
Pisama uses five detection tiers:
| Tier | Method | Cost | When It Runs |
|---|---|---|---|
| 1 | Hash comparison | ~$0.00 | Always — every trace |
| 2 | State delta analysis | ~$0.00 | Always — every trace |
| 3 | Embedding similarity | $0.01-0.02 | When tiers 1-2 are inconclusive |
| 4 | LLM judge | $0.02-0.10 | Gray-zone cases only |
| 5 | Human review | Variable | High-stakes decisions |
Tiers 1 and 2 are pure heuristics: hash collisions, type changes, pattern matching, coverage counting. They run on every trace and catch the majority of failures at zero marginal cost.
Tier 3 uses embeddings for cases that require fuzzy matching — semantic loop detection (same meaning, different words), persona drift measurement, grounding verification. This costs a few cents per trace.
Tier 4 invokes an LLM only for cases where the lower tiers produced low-confidence results. On TRAIL, approximately 40% of failures require escalation beyond heuristics. But the remaining 60% are caught for free.
The average cost per trace across our production workload is under $0.05. Compare that to running every trace through a frontier LLM at $0.10-0.30 per trace — a 2-6x cost reduction with better accuracy on structural failures.
If you're building evaluation pipelines for AI agents, three takeaways from our data:
1. Don't default to LLM-as-judge for everything. It's the most expensive option and underperforms on structural failure categories. Use it where it adds unique value: blame attribution, novel failure detection, subjective quality.
2. Invest in heuristic detectors for known failure patterns. Loops, state corruption, context neglect, specification compliance — these have strong structural signals. A well-calibrated heuristic detector will be faster, cheaper, and more accurate than an LLM judge for these categories.
3. Tier your detection pipeline. Run cheap checks first. Escalate to expensive checks only when needed. This isn't just a cost optimization — it's an accuracy optimization. Heuristics have higher precision on structural failures because they're measuring the exact signal rather than reasoning about it.
The 60.1% vs 11% gap on TRAIL isn't because frontier LLMs are bad at reasoning. It's because systematic structural analysis is a different skill than open-ended language understanding, and purpose-built tools outperform general-purpose tools on well-defined tasks. This has been true in software engineering for decades. It's equally true for agent evaluation.
pip install pisama
from pisama import analyze
result = analyze("trace.json")
for issue in result.issues:
print(f"[{issue.type}] {issue.summary}")
print(f" Severity: {issue.severity}/100")
CLI and MCP server for IDE integration:
pisama analyze trace.json
pisama watch python my_agent.py
pisama detectors
Full documentation at docs.pisama.ai. Source and benchmark reproduction instructions at github.com/tn-pisama/mao-testing-research.
All calibration data, benchmark scripts, and detector source code are open. We'd rather have the approach scrutinized and improved than accepted on authority.