Eastern DevI Ran a Health Check on 3 Popular AI Agents. The Results Were Horrifying. You wrote 100...
You wrote 100 lines of agent code. You called the OpenAI API, wired up a tool, maybe added a retry loop. It works in the demo. It works in staging. You ship it.
But have you checked how fragile it actually is?
I ran nb doctor v2 — an open-source diagnostic CLI that scans your Python codebase for agent health risks — against three popular open-source agent projects. What I found explains why 87% of production agents experience 3 or more disruptions per week, and why 72% of runtime failures never self-heal.
Let me show you the numbers.
nb doctor v2 scores your agent across four dimensions:
| Dimension | What It Checks |
|---|---|
| Reliability | Retry storms, dead loops, unchecked tool calls, missing timeouts |
| Context Health | Unbounded message history, missing max_tokens, context drift |
| Cascade Risk | No circuit breakers, no checkpoints, unbounded fan-out |
| Security | Prompt injection, hardcoded keys, eval/subprocess, overprivileged tools |
Each dimension gets a 0–100 score. Below 60 is a failing grade. Below 40 means your agent is an incident waiting to happen.
Here's what happened when I scanned a popular CrewAI-based project with ~800 lines of agent code:
╔══════════════════════════════════════════╗
║ 🏥 NeuralBridge Doctor v2.0 ║
║ Agent Health Diagnosis Report ║
╠══════════════════════════════════════════╣
║ ║
║ Reliability ████████░░ 78% B ║
║ Context Health ██████░░░░ 62% C ║
║ Cascade Risk ████░░░░░░ 41% D ║
║ Security ███████░░░ 71% C+ ║
║ ║
║ Overall Grade: C+ ║
║ Critical Issues: 3 Warnings: 7 ║
╚══════════════════════════════════════════╝
A C+. On a project with 800 lines. Three critical issues. Seven warnings.
Let's break down what nb doctor actually found — and why each one is a production time bomb.
# agent.py line 47
response = openai.chat.completions.create(model="gpt-4", messages=messages)
No try/except. When OpenAI goes down — and it does, for 34 hours straight in 2025 — your agent crashes. No fallback. No retry. Just a stack trace at 3 AM and an alert nobody's looking at.
nb doctor flagged this as CRITICAL because it's the #1 cause of agent outages: naked API calls with zero resilience.
# pipeline.py line 112
while True:
result = client.run(agent_config)
# ... no break condition, no backoff, no max retries
This is a retry storm waiting to happen. The agent loops forever, hammering the API with identical requests. One real incident from our industry report: a support agent retried a CRM lookup 847 times in 22 minutes. Every call returned 200 OK. The monitoring dashboard showed green. The agent was burning tokens and producing nothing.
# config.py line 8
openai_api_key = "sk-proj-xxxx..."
This needs no explanation. But nb doctor finds it anyway — because people still do it.
The seven warnings are quieter but equally deadly over time:
max_tokens on 4 API calls — responses can bloat the context window until the model starts hallucinatingmessages.append() without truncation — context grows unbounded across a long-running sessionIndividually, each warning looks minor. Together, they explain why your agent works in testing but falls apart after 6 hours in production.
I scanned two more agents — a LangGraph research agent and a custom ReAct implementation. The pattern was identical:
| Agent | Lines | Reliability | Context | Cascade | Security | Overall |
|---|---|---|---|---|---|---|
| CrewAI-based | 812 | 78% | 62% | 41% | 71% | C+ |
| LangGraph research | 1,204 | 71% | 58% | 35% | 65% | C |
| Custom ReAct | 543 | 82% | 70% | 48% | 59% | C |
None of them broke B on cascade risk. All of them had at least 2 critical issues. The average overall grade was a C.
These aren't bad developers. They're normal developers building agents with normal tooling — tooling that was never designed for autonomous, long-running, multi-step execution.
These scan results aren't outliers. They match what's happening across the industry:
gpt-4 call dead in the waterThe gap isn't in AI capability. It's in operational resilience.
pip install neuralbridge-sdk
nb doctor /path/to/your/agent
This scans your entire codebase and gives you the radar chart — every naked API call, every unbounded message list, every missing circuit breaker. Zero config. Zero dependencies. You'll know exactly where your agent is fragile.
Based on what nb doctor finds, the most common fixes are:
max_tokens to prevent context bloatmessages = messages[-MAX_HISTORY:]
while loopos.environ
Manual fixes work today. But when OpenAI goes down at 3 AM, you need automated recovery:
from neuralbridge import register, heal
# Register fallback models
register("gpt-4", strategy="fallback",
alternatives=["gpt-4o-mini", "claude-3.5-sonnet"])
# Wrap your LLM calls — auto-retry, auto-fallback, auto-heal
response = heal(lambda: openai.chat.completions.create(
model="gpt-4", messages=messages, max_tokens=2048
))
When the primary model fails, NeuralBridge automatically falls back. When context bloats, it triages. When a cascade starts, it circuit-breaks. 95.19% self-heal rate. 0.0025ms overhead.
Your agent isn't as reliable as you think. The demo doesn't test for retries at 3 AM, context overflow after 6 hours, or model outages that last a day and a half.
Run the diagnostic. See the numbers. Then decide if you want to keep crossing your fingers — or actually fix the problem.
pip install neuralbridge-sdk
nb doctor .
Your agent's report card is waiting. I hope it's better than a C+.
This is Article 9 in our Agent Runtime Operations series. Read Article 7 on how Anthropic's price hikes are bleeding agent budgets and Article 8 on why we're defining a new operational category.