I built a scorecard that grades each AI agent's ROI — here's how it works

I built a scorecard that grades each AI agent's ROI — here's how it works

# ai# opensource# mcp# agents
I built a scorecard that grades each AI agent's ROI — here's how it workschris

I was running 11 AI agents — sales outreach, customer support triage, document review, lead scoring,...

I was running 11 AI agents — sales outreach, customer support triage, document review, lead scoring, content generation. They were all "working." But I couldn't answer the question every manager asks about their team: "who's pulling their weight?"

I had cost dashboards. I could see total LLM spend. But no one could tell me: this agent made $5,000 in pipeline and cost $800. That one cost $400 and produced nothing measurable.

So I built Metrx, an AI workforce scorecard. It treats each agent like an employee with a P&L — tracking both what they cost and what they produce. After dogfooding it for three months, here's what I learned about managing AI agents like a workforce.

The Real Problem Isn't Cost — It's Accountability
**
Everyone talks about LLM costs. But cost is just one side of the equation. The real question is: **are your agents creating value?

Most teams I've talked to can tell you their monthly OpenAI bill. Almost none can tell you:

  • Which agent drove the most revenue
  • Which agent has the best cost-to-output ratio
  • Which agent should be promoted (scaled up) and which should be fired (shut down)

This is the same visibility gap that existed in human workforce management before performance reviews became standard. We're just earlier in the curve with AI agents.

Architecture: The Agent Attribution Pipeline

The system has three layers, designed around attributing performance to individual agents:

┌─────────────────────────────────────┐
│ Your AI Agents │
│ (Change base URL, that's it) │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│ Metrx Gateway │
│ (Cloudflare Workers, <5ms) │
│ │
│ • Tags every call by agent + task │
│ • Attributes cost to each agent │
│ • Forwards to provider unchanged │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│ Metrx Scorecard Dashboard │
│ (Next.js 14 + Supabase) │
│ │
│ • Agent-level P&L statements │
│ • ROI grades per agent │
│ • Revenue attribution (Stripe) │
│ • Performance rankings │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│ MCP Server (Open Source) │
│ (23 tools, TypeScript, MIT) │
│ │
│ • Agents query their own P&L │
│ • Self-optimization decisions │
│ • Board-ready ROI audit reports │
│ • A/B model experiments │
└─────────────────────────────────────┘

*Revenue Attribution: The Core Feature
*

This isn't an add-on. This is the whole point.

Cost tracking alone tells you what you spent. Revenue attribution tells you what you earned. Together, they give you a P&L per agent — and that's what lets you manage AI agents like a workforce.

Metrx connects to Stripe, HubSpot, and Calendly to attribute revenue back to each agent. If your sales outreach agent costs $800/month but generates $12,000 in pipeline, that's a 15x ROI — promote it (scale it up, give it more leads). If your document review agent costs $400/month and you can't attribute any measurable output, it's time for a performance review.

The attribution engine links: agent activity → task completion → revenue event → P&L scorecard.

*Here's what querying agent ROI looks like through the MCP server:
*

You: "What's the ROI breakdown for my sales outreach agent this month?"

Metrx (via metrx_get_task_roi):
Agent: sales-outreach
Period: March 2026
Total Cost: $847.23
Attributed Revenue: $14,200
ROI: 16.8x
Grade: A+
Recommendation: Scale — increase lead volume allocation

*The MCP Server: 23 Tools for Agent Workforce Management
*

The open-source piece is a Model Context Protocol server that lets Claude, Cursor, or any MCP-compatible client query agent performance data directly.

The key insight: agents themselves can use these tools. An agent can check its own ROI, compare its performance to other agents, and recommend optimization actions. This is the start of self-managing AI workforces.

*The 23 tools (all prefixed `metrx_`) cover 10 domains:
**
| Domain | Tools | What It Does |
|--------|-------|-------------|
| Agent Fleet Overview | 3 | Agent scorecards, performance summaries, detailed agent profiles |
| Optimization | 4 | Model routing, provider arbitrage, cost-per-quality recommendations |
| Budgets | 3 | Spend limits, enforcement modes, budget status |
| Alerts | 3 | Threshold monitoring, acknowledgment, failure prediction |
| Experiments | 3 | A/B model testing, results with statistical significance, winner promotion |
| Cost Leak Detection | 1 | Comprehensive 7-check waste audit |
| Revenue Attribution | 3 | Revenue linking, per-agent ROI calculation, multi-source attribution reports |
| Alert Configuration | 1 | Threshold tuning with automated actions |
| ROI Audit | 1 | Board-ready fleet performance reports |
| Upgrade Justification | 1 | Business case generation for tier upgrades |

*Integration: One Line Change
*

// Before
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// After — just change the base URL
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://gateway.metrxbot.com/v1",
  defaultHeaders: {
    "x-metrx-agent": "sales-outreach",
  },
});
Enter fullscreen mode Exit fullscreen mode

That header is what enables agent-level attribution. Every call tagged with an agent identity flows into that agent's scorecard. Sub-5ms overhead.

*The Self-Optimizing Loop
*

Here's what gets me excited about the MCP approach. When agents have access to their own performance data, they can:

  1. Self-assess: "My ROI dropped 20% this week — what changed?"
  2. Self-optimize: "I'm using GPT-4o for classification that GPT-4o-mini handles at 1/10th the cost"
  3. Self-report: "Generate a board-ready audit of my fleet's performance this quarter"
  4. Self-experiment: "Run an A/B test — does switching to Claude Haiku for my routing layer maintain quality at lower cost?"

This is the difference between a cost dashboard (humans stare at charts) and a workforce management system (agents manage their own performance).

*Try It
*

  • Dashboard: metrxbot.com — free tier (3 agents), no credit card
  • MCP Server: github.com/metrxbots/mcp-server — MIT licensed
  • npm: npx @metrxbot/mcp-server — try in 30 seconds with --demo flag
  • Pricing: Free → Lite ($19/mo, 10 agents) → Pro ($49/mo, unlimited)

If you're running AI agents in production, I'd love to hear: how do you know which agents are worth keeping? Drop a comment or find me on X @metrxbot_.