Moazzam QureshiMost teams ship an AI agent, watch it work in a demo, and push it to production. Then it breaks on...
Most teams ship an AI agent, watch it work in a demo, and push it to production. Then it breaks on real traffic and nobody can say why. The gap between "worked in the demo" and "works in production" is almost always an evaluation gap — there was never a systematic way to measure what the agent actually does once real users hit it.
This is the complete evaluation process I run on every production agent I audit. It is vendor-neutral: the concepts apply whether you use LangSmith, Braintrust, Langfuse, Arize, or a homegrown harness. Treat it as the reference you wish someone had handed you before you shipped.
Every serious evaluation practice has exactly two modes, and they form a single continuous loop:
Offline evaluation — "test before you ship." You evaluate against a curated dataset during development, so you can compare versions and catch regressions before they reach users.
Online evaluation — "monitor in production." You evaluate real user interactions on live traffic, in real time, so you detect issues on the inputs your users actually send.
The loop closes when failing production traces flow back into your offline dataset. A real failure your monitoring caught becomes a new test case, so the next version is evaluated against the exact thing that hurt you. This feedback loop is the difference between an agent that gets more reliable over time and one that decays.
┌─────────────── offline evaluation ───────────────┐
│ datasets → evaluators → experiments → analysis │
└───────────────────────┬────────────────────────────┘
│ ship the version that passed
▼
┌─────────────── online evaluation ───────────────┐
│ production runs → evaluators → monitoring │
└───────────────────────┬────────────────────────────┘
│ failing traces become test cases
└──────────► back to datasets
A dataset is a collection of test cases (examples), each with an input and, for offline evals, a reference output. The single highest-leverage decision in your whole eval practice is where the dataset comes from.
Three sources, not equally valuable:
The mistake I see most: the team built the dataset by imagining how users behave. The eval passes. Production fails. The fix is always to rebuild from real traces. If you have no eval set at all (the common case), this is also how you build your first one.
Four types, and choosing the right one per criterion is what separates real evaluation from theater:
Run your evaluators against your dataset to produce an experiment: a measurement of one agent version on one dataset. Four use cases:
The key difference: no reference output exists. A real user sends a real input; nobody knows the "right" answer. So you use reference-free evaluators: groundedness, format validity, safety checks, refusal correctness, tool-call validity, trajectory sanity.
This is where you catch agent decay — the agent shipped working and is silently worse two months later. It shows up in eval metrics (hallucination rate, tool-call accuracy, cost per task) long before it shows up in your product dashboards. Wire anomaly alerts into Slack/PagerDuty, not a dashboard nobody opens.
The fundamental split most teams miss: output metrics vs trajectory metrics.
Most teams measure only outputs. An agent can produce a correct answer while calling the same tool 14 times and burning $3 on a $0.02 task. Output-only evaluation scores that as a pass. It is not a pass — it is a production incident waiting for scale.
And kill the single aggregate "87% pass rate." It hides which 13% failed (the high-stakes cases?), whether failures cluster in one category, and whether you regressed. Decompose by category, track over time, surface the specific failing examples.
Four patterns, from real audits:
None of these are model problems. Switching GPT-4 → Claude → Gemini fixes none of them. They are engineering problems with known solutions.
I wrote this up in full, chapter by chapter (datasets, evaluators, offline, online, metrics), as an open guide here: The complete AI agent evaluation process. It is the exact process my firm runs on every production agent we audit.
If your agent is in production and breaking in ways you can't measure, that's the gap. Happy to talk through it — fixmyagent.agency.