utibe okodiEnterprises are pouring money into AI agents. The results are brutal. MIT's NANDA initiative just...
Enterprises are pouring money into AI agents. The results are brutal.
MIT's NANDA initiative just published "The GenAI Divide: State of AI in Business 2025" — a study based on 150 leader interviews, 350 employee surveys, and analysis of 300 public AI deployments. The headline finding: about 5% of AI pilot programs achieve rapid revenue acceleration. The vast majority stall, delivering little to no measurable impact on P&L.
That's despite $30–40 billion in enterprise spending on generative AI.
Meanwhile, IBM's 2025 CEO Study — surveying 2,000 CEOs — found that only 25% of AI initiatives have delivered expected ROI, and just 16% have been scaled across the enterprise.
So what separates the 5% from the 95%?
According to LangChain's State of AI Agents report, 51% of the 1,300+ professionals surveyed already have AI agents running in production. Another 78% have active plans to deploy soon. Mid-sized companies (100–2,000 employees) are the most aggressive — 63% already have agents live.
But here's the gap: most teams shipping agents to production cannot see what those agents are actually doing.
A typical multi-step AI agent might:
When this chain breaks — and it will — where did it go wrong? Was it the retrieval step returning irrelevant documents? The LLM hallucinating despite good context? A tool call timing out silently? A prompt template that worked in testing but fails on edge cases?
Without distributed tracing, you're debugging blind. You get an input and an output, with no visibility into the six steps in between.
The teams that extract real value from AI agents treat them like any other production system: they instrument them.
The MIT NANDA study found that the core differentiator wasn't talent, infrastructure, or regulation. It was learning, integration, and contextual adaptation — which requires understanding how your agents behave in the real world, not just in a Jupyter notebook.
Concretely, the teams that succeed do three things:
Not just the LLM call — every tool invocation, every decision point, every data retrieval. A proper trace shows you the full execution graph: what the agent decided to do, what data it accessed, what the LLM returned at each step, and how long each operation took.
This is the difference between "the agent gave a wrong answer" and "the agent retrieved the right documents but the LLM ignored the most relevant paragraph because the context window was filled with a previous tool call's output."
A single agent workflow might hit OpenAI for reasoning, Anthropic for evaluation, and Google for embeddings. Most teams have no idea what a single agent run actually costs — let alone how that breaks down by user, feature, or team.
When you're running 10,000 agent executions a day across three LLM providers, the bill is not theoretical. And without per-trace cost attribution, you can't optimize what you can't measure.
The difference between a demo and production isn't speed — it's quality at scale. A single hand-tested prompt doesn't tell you how the agent performs across 10,000 different user inputs.
Automated evaluation — using LLM-as-judge scoring for relevance, coherence, and hallucination detection on every trace — turns observability from a debugging tool into a quality system.
Beyond basic tracing and cost tracking, there are deeper failure modes that existing tools don't address at all — and they're the ones causing the most expensive production incidents.
Here's a failure mode most teams don't even know to look for: agents that skip tool execution entirely and fabricate the results.
Instead of actually calling your database, search API, or calculation tool, the agent generates a plausible-looking response as if it had. The output looks normal. No error is thrown. But the data is completely made up.
This isn't theoretical. It's documented across every major framework — crewAI, LangGraph, AutoGen, and even at the model level with OpenAI. Academic research has found tool hallucination rates as high as 91.1% on challenging subsets. LangGraph proposed a "grounding" parameter (RFC #6617) to address this, but hasn't shipped it.
No existing observability tool detects this. They trace the span, record the output, and move on — never verifying that the tool was actually executed and the result matches reality.
Every observability tool on the market shows you traces the same way: as a flat table of spans, or a sequential waterfall chart. This works for simple chain-of-thought workflows. It completely breaks down for multi-agent systems where agents make branching decisions.
When Agent A decides to delegate to Agent B instead of Agent C, then Agent B decides to call two tools in parallel, and the combined results trigger a third agent — you need to see this as what it is: a decision tree, not a sequential log.
While Arize Phoenix has introduced an Agent Visibility tab with basic graph visualization, no tool offers a fully interactive decision-tree view of agent execution paths. LangSmith, Langfuse, Braintrust, Helicone, Opik, and HoneyHive still rely on tabular or span-level views. The visual decision tree remains a largely unsolved UX problem in this market.
Multi-agent architectures are the fastest-growing pattern in AI development. But multi-agent tracing is broken in every existing tool:
You can see that Agent A called Agent B. You cannot see why Agent A chose Agent B over Agent C, what context was lost in the handoff, or why the negotiation between three agents converged on a suboptimal plan.
Enterprises already run OpenTelemetry for their backend services. Their AI agents should emit traces into the same system — not require a separate vendor with a separate SDK.
But OTel's semantic conventions for AI agents are still in "Development" status as of February 2026. The conventions cover individual LLM calls but lack standard attributes for agent orchestration (planning, tool selection, delegation), tool call semantics, multi-agent communication, memory operations, evaluation results, and cost attribution. Each vendor that claims "OTel support" extends the base conventions differently, creating fragmentation.
Arize Phoenix is the closest to OTel-native, but it jumps from $50/month to $50K–$100K/year at the enterprise tier — a pricing cliff that locks out mid-market teams. Datadog now supports OTel-native ingestion for LLM observability, but its full feature set still relies on the ddtrace SDK.
The existing options each have trade-offs that leave mid-market teams underserved:
What's missing is agent-native observability that gives you visual decision-tree debugging, multi-agent trace correlation, silent failure detection, and OTel-native instrumentation — without per-seat pricing that scales against you, or infrastructure you have to run yourself.
If you could see every step your AI agent takes — as an interactive decision tree, not a flat table — understand exactly why it failed, catch silent fabrications before they reach users, and plug directly into your existing OTel infrastructure — what would that change for your team?
I'm researching how engineering teams debug AI agents in production. If you're building with LangChain, CrewAI, AutoGen, or custom agents and have 15 minutes, I'd genuinely like to hear how you approach this today.
No pitch. I'm collecting real data on this problem and happy to share what I'm learning from other teams.