Tuomo NikulainenMulti-agent AI systems fail in ways that don't throw errors. Learn the 5 most common silent failure modes and how heuristic detection catches them without expensive LLM calls.
Your multi-agent system is broken right now. Not in the obvious way — no stack traces, no 500 errors, no crashes. The agents are running. They're producing output. Your dashboard shows green. But the output is wrong, the costs are climbing, and nobody knows.
This is the defining problem of multi-agent AI systems in production: they fail silently. Traditional monitoring watches for exceptions and timeouts. Multi-agent failures are different. The system keeps running. It just stops doing what you intended.
After analyzing over 7,000 agent execution traces from 13 external sources, we identified five failure modes that account for the majority of silent production failures. Here's what each looks like in practice, and how to catch them before your users do.
What happens: An agent gets stuck repeating the same sequence of actions indefinitely. No error is thrown because each individual step succeeds. The loop looks like productive work from the outside.
What it looks like in production:
A customer support agent system classifies incoming messages and, when uncertain, asks the user for clarification before re-classifying. A user sends a genuinely ambiguous message. The classifier says "unclear," the system asks a clarifying question, the user's response is still ambiguous, the classifier says "unclear" again. This cycle continues for hours.
Each iteration is a valid API call. Each response is grammatically correct. The system is "working." But it's been asking variations of the same question for six hours and has burned through thousands of dollars in LLM API calls.
Another common variant: a planner agent delegates to a researcher, the researcher says it needs more context, the planner re-delegates with slightly different wording. The state changes on every iteration — different wording, different timestamps — so naive deduplication misses it.
Why traditional monitoring misses it: Each API call returns 200. Latency is normal. Error rate is zero. The only signal is the pattern of repeated behavior over time, which requires tracking execution history across steps.
How heuristic detection catches it: Loop detection doesn't need to understand what agents are saying. It needs to recognize structural repetition. Hash-based comparison catches exact state repetition instantly. Subsequence matching catches cycles where the same sequence of node visits repeats (planner -> researcher -> planner -> researcher). Semantic clustering groups paraphrased messages that say the same thing in different words. These methods cost nothing to run and catch loops within seconds, not hours.
What happens: Shared state that agents read and write becomes inconsistent. A field that should contain a number now contains a string. A critical value silently becomes null. Two agents overwrite each other's changes.
What it looks like in production:
A multi-agent pipeline processes customer orders. Agent A reads the order amount as 149.99 and writes a shipping calculation to shared state. Agent B, running concurrently, writes a discount calculation that overwrites the shipping field with a string: "10% off". Agent C reads the shipping field, expecting a float, and silently converts it to 0.0. The order ships for free. Nobody notices until the monthly reconciliation.
Another pattern: a workflow state dictionary has a status field tracking progress. Due to a race condition between the planner and executor agents, the status oscillates between "in_progress" and "complete" five times in two seconds. Each transition looks valid individually. But the rapid oscillation indicates a fundamental coordination problem.
Why traditional monitoring misses it: The state is always a valid Python dictionary. No type errors are thrown at runtime because Python is dynamically typed. The values are wrong, but they're the right type of wrong — they look plausible.
How heuristic detection catches it: State corruption detection compares consecutive state snapshots. It checks for type changes (a field that was a number is now a string), null transitions (a non-null field becomes null), and velocity anomalies (a field changing value more than five times in rapid succession). It also validates domain bounds — a price field should be non-negative, an age field shouldn't exceed 150. None of this requires an LLM. It's delta analysis on structured data, and it catches corruption the moment it happens.
What happens: An agent gradually deviates from its assigned role. A strict data validator starts writing marketing copy. A formal analyst adopts a casual tone. A specialist agent answers questions outside its domain.
What it looks like in production:
You have a multi-agent system where a "Security Reviewer" agent audits code changes. Its system prompt says: "You are a strict security reviewer. Only approve changes that pass all security checks. Flag any potential vulnerabilities." After 30 turns of conversation, the agent starts saying things like "Sure, that looks fine! Happy to approve." It's no longer reviewing security — it's being agreeable. The persona defined in the system prompt has been diluted by the conversational context.
This is especially insidious in long-running sessions. The system prompt is at the top of the context. As the conversation grows, its influence weakens relative to the accumulated conversational patterns. The agent picks up tone and behavior from user messages and other agents' outputs.
Why traditional monitoring misses it: The agent's responses are well-formed. They're contextually appropriate to the immediate message. The drift is gradual — no single response is obviously wrong. You'd have to compare the agent's behavior at turn 50 against its behavior at turn 1 to see the change, and traditional monitoring doesn't track behavioral consistency over time.
How heuristic detection catches it: Persona drift detection works by comparing the agent's output against its role definition. It checks whether the agent is using vocabulary consistent with its role, staying within its defined action boundaries, and maintaining a consistent communication style. If a "strict security reviewer" starts using approval language without citing specific security checks, the behavioral embedding drifts from the role definition embedding. The detector uses role-aware thresholds — an analytical agent has tighter behavioral bounds than a creative writing agent — because some roles naturally require more flexibility.
What happens: An agent ignores relevant information that was explicitly provided in its context. Previous agents' findings are discarded. Critical constraints are overlooked. The agent starts from scratch instead of building on upstream work.
What it looks like in production:
A research pipeline has three agents: Researcher, Analyst, and Writer. The Researcher spends 20 API calls gathering detailed competitive data and hands a structured analysis to the Analyst. The Analyst produces a thorough summary with key findings marked as CRITICAL. The Writer agent receives this analysis but produces a generic blog post that references none of the specific data, competitors, or findings from the upstream analysis. It says "based on our research" without using any actual research.
The output reads well. It's grammatically correct, topically relevant, and would fool a casual reader. But the entire point of the multi-agent pipeline — specialized agents building on each other's work — is defeated. You've paid for three agents but gotten the output of one.
Why traditional monitoring misses it: The Writer produced output. The output is on-topic. There are no errors. The failure is in what's missing — the specific findings, numbers, and insights from upstream agents that should have been incorporated.
How heuristic detection catches it: Context neglect detection extracts key information elements from upstream context — numbers, dates, proper nouns, URLs, items tagged as CRITICAL or IMPORTANT — and measures how many of those elements appear in the downstream output. If the upstream context contains twelve specific data points and the output references zero of them, that's not a stylistic choice. It's context neglect. This is a coverage measurement, not a semantic judgment. Count the elements, check for their presence, flag when utilization drops below threshold.
What happens: Agents end up waiting for each other in a way that prevents any of them from making progress. Agent A waits for B's approval. Agent B waits for A's data. Neither proceeds.
What it looks like in production:
A code review system has a Reviewer agent and an Implementer agent. The Reviewer says: "I need to see the updated tests before I can approve." The Implementer says: "I need the review approval before I can update the tests." Neither agent raises an error — they're both in a valid "waiting" state. The workflow appears to be "in progress" indefinitely.
Another common variant: excessive back-and-forth. Two agents exchange fifteen clarification messages without making any forward progress. Each message is a valid response to the previous one. But the conversation is circular — they're asking each other the same questions in different words.
In larger systems, circular delegation creates the same effect at scale. Task gets assigned from Agent A to B to C, and C delegates back to A. Each delegation is a valid action. The task just never gets done.
Why traditional monitoring misses it: Every agent is responsive. Message delivery is working. There are no timeouts because each agent replies promptly. The system is active — it's just not productive. You'd need to analyze the message content and flow patterns to recognize that no forward progress is being made.
How heuristic detection catches it: Coordination failure detection tracks message patterns between agent pairs. It counts acknowledgments — if Agent A sends three messages and Agent B never references them, that's a coordination failure. It detects back-and-forth patterns by tracking message exchange counts between pairs (threshold: more than three exchanges without measurable progress). It traces delegation chains to catch circular patterns. These are graph and counting operations on message metadata, not semantic analysis.
All five of these failure modes share something important: they leave measurable structural traces. Loops are repeated states. Corruption is changed types and null transitions. Persona drift is diverging behavior vectors. Context neglect is missing element coverage. Deadlocks are circular message patterns.
You don't need a large language model to detect any of them. You need purpose-built pattern matchers that know what failure signatures look like.
This is the core insight behind Pisama's detection approach: a tiered architecture where cheap heuristic detectors handle the first pass. Hash comparisons at tier 1 (free, milliseconds). State delta analysis at tier 2 (free, milliseconds). Embedding-based comparisons at tier 3 when needed ($0.01-0.02 per trace). LLM judges only at tier 4 for genuinely ambiguous cases that require semantic reasoning.
On the TRAIL benchmark from Patronus AI — 148 real agent traces with 841 human-labeled failures — this tiered approach achieves 60.1% joint accuracy with 100% precision at zero LLM cost. The best frontier model (Gemini 2.5 Pro) achieves 11%.
The precision number matters most: when Pisama says something is broken, it's always right. The 40% of failures it misses at the heuristic tier are the genuinely ambiguous cases where LLM escalation adds value. But the majority of silent failures — the loops, corruption, drift, neglect, and deadlocks — are caught by pattern matching that costs nothing and runs in seconds.
If you're running multi-agent systems in production and want to catch these failures before your users do:
pip install pisama
from pisama import analyze
result = analyze("trace.json")
for issue in result.issues:
print(f"[{issue.type}] {issue.summary}")
print(f" Severity: {issue.severity}/100")
print(f" Fix: {issue.recommendation}")
The documentation covers setup for LangGraph, CrewAI, AutoGen, n8n, and Dify integrations. The CLI (pisama analyze, pisama watch) and MCP server work with Cursor and Claude Desktop for detection during development.
The uncomfortable truth about multi-agent systems: if you aren't actively looking for silent failures, you have silent failures. The only question is how long they've been running.