Citations for Your Data Incidents: Introducing EvidenceCapsule

# ai

Blaine Elliott

When an AI agent answers "why is this table stale?" the hardest question is whether you can trust the answer. EvidenceCapsule is a typed investigation output where every claim cites the row it came from, so you can click through and verify.

When an AI agent answers "why is orders stale this morning?" you get one of two things. Either the agent dumps every row it touched into the response (token-wasteful, hard to read, no audit trail), or it writes a confident prose answer that name-drops events without telling you which row in which table it came from. The second mode is faster to read and structurally untrustworthy. You cannot click through to verify a single claim.

EvidenceCapsule is the typed investigation output we shipped to fix that. Every piece of evidence the agent surfaces during an investigation is a citation: a typed row that points back to the actual record (alert, schema change, metric, lineage edge, intelligence note) in your AnomalyArmor instance, with a clickable deep link in the UI and a stable JSON shape for the SDK and the Claude Code skill.

The problem this is solving

Investigations in a data warehouse pull from a half-dozen sources at once. A real "why is this stale" answer has to consider:

The freshness check that fired and when
Schema changes on the upstream table or its dependencies in the last few days
Alerts on related assets that may be the actual root cause
Recent metric anomalies that point at the same window
Lineage edges showing which downstream consumers depend on this asset

An agent that does this well needs all of those rows in context. An agent that does this trustably needs to tell you, claim by claim, which row supported which conclusion. Without that, the response is structurally indistinguishable from a hallucination, even when it is technically correct.

This is the same problem as a research paper without citations. The argument might be right, but you cannot check the argument because you cannot find the underlying source.

What an EvidenceCapsule looks like

Every investigation returns a capsule with five fields plus the question that was asked:

Field	What it contains	Example
`question`	The natural-language question that triggered the investigation	"Why is the orders table 14 hours stale?"
`root_cause`	A concise hypothesis grounded in the cited evidence	"Upstream Fivetran sync for `raw.orders` failed at 19:04 yesterday and has not recovered"
`triggers`	Typed evidence rows that point at the precipitating events	Schema change on `raw.orders.customer_id`, alert fired at 9:04
`consequences`	Typed evidence rows for what is downstream-affected	Three dashboards consuming `fct_orders` show stale data
`confidence`	A numeric confidence (0.0 to 1.0) aggregated from the underlying correlators	0.82
`open_questions`	Things the agent could not resolve and worth a human looking at	"Is the Fivetran credential expired or did the upstream API change?"

Each evidence row in triggers and consequences is a typed object with a summary, a source (alert, schema change, metric, lineage edge, intelligence note), a UUID pointing at the actual record, the time it was observed, and the structured fields the correlator pulled. In the UI, the summary becomes a clickable link that takes you directly to the alert detail, the schema diff, the metric chart, or the asset page.

A real capsule, returned by the SDK or rendered in the agent UI, looks like this:

question: "Why is the orders table 14 hours stale?"
root_cause: >
  Upstream Fivetran sync for raw.orders failed at 19:04 yesterday
  and has not recovered. Downstream fct_orders built at 06:00 today
  against stale source data.
confidence: 0.82

triggers:
  - source: alert
    source_id: 8c40f984-1a7c-4576-bb8b-eb442ce14965
    summary: "Freshness check failed on raw.orders (14h since last update, SLA 6h)"
    observed_at: 2026-05-24T09:04:11Z
    fields:
      asset: raw.orders
      expected_freshness_hours: 6
      actual_freshness_hours: 14
  - source: schema_change
    source_id: c1a908ef-3e32-45b7-af3d-74c9ab2a9605
    summary: "Column customer_id type changed from BIGINT to VARCHAR on raw.orders"
    observed_at: 2026-05-23T14:31:08Z
    fields:
      asset: raw.orders
      column: customer_id
      change_type: type_changed
      old_type: BIGINT
      new_type: VARCHAR

consequences:
  - source: lineage
    source_id: 73dbca36-a05f-4d71-97ff-60b1dd054b5c
    summary: "exec_revenue dashboard depends on fct_orders, last refreshed 07:15 on stale data"
    observed_at: 2026-05-24T07:15:00Z
    fields:
      consumer: exec_revenue
      consumer_type: dashboard
      depth: 2
  - source: lineage
    source_id: 6a1286f7-ea8c-a400-018e-9d00f3f3b20f
    summary: "finance_daily dashboard depends on fct_orders, last refreshed 07:45 on stale data"
    observed_at: 2026-05-24T07:45:00Z
    fields:
      consumer: finance_daily
      consumer_type: dashboard
      depth: 2

open_questions:
  - "Is the Fivetran credential expired or did the upstream API change?"
  - "Should fct_orders be paused until raw.orders recovers?"

Every source_id resolves to a real record in the AnomalyArmor instance. The UI renders each evidence row as a citation chip linking to the corresponding detail page. The SDK exposes the same shape as typed Pydantic models you can iterate over directly.

You read the prose answer. If a claim looks suspicious, you click the link next to it. Either the record supports the claim or it does not. There is no third option, which is the whole point.

Why this changes how trust works with AI investigations

The standard objection to LLM-driven analysis in production data systems is "I cannot tell when it is wrong." That objection is correct as a general statement, and the usual response (better prompts, more guardrails, careful evals) is real but incomplete. The structural fix is not making the model better at not hallucinating; it is removing the path by which a hallucination can hide.

A capsule with citations cannot hide a hallucinated claim, because the claim is either backed by a row that exists or it is not. The link either resolves to a real record or it 404s. A reviewer (a data engineer, an on-call rotation, a manager reading a postmortem) can spot-check three citations in 30 seconds and develop calibrated trust about whether the whole capsule is reliable.

That is a different shape of trust than "I read the answer and it sounded right." It is closer to how engineers already trust SQL: not because the database always returns the right answer, but because they can read the query and the table and verify.

How this works across the surfaces you actually use it from

EvidenceCapsule is the same shape whether you are in the web UI, asking the SDK from a notebook, or running the Claude Code skill from your terminal. The mental model does not change between contexts.

In the web agent. Ask "why is orders stale" in the chat. The response is the agent's prose answer plus a rendered capsule below it. Each evidence row shows up as a citation chip: schema-change icon, "Column customer_id type changed on raw.orders (2026-05-23 14:31)", clickable. The agent's prose narrates the case; the capsule gives you the receipts.

In the SDK. A new client.investigations resource has two methods. get(id) returns the capsule for an existing investigation. explain(asset_id, question) runs an ad-hoc investigation against a single asset and returns the capsule directly. Both return typed EvidenceCapsule and Evidence models you can iterate over, log, or feed into your own downstream processing.

from anomalyarmor import AnomalyArmor

client = AnomalyArmor(api_key=...)
capsule = client.investigations.explain(
    asset_id="abc-123",
    question="Why is this table 14 hours stale?",
)
print(capsule.root_cause)
for evidence in capsule.triggers:
    print(f"  - {evidence.summary} ({evidence.source}: {evidence.source_id})")

In the Claude Code skill. The armor-investigate skill used to be a five-step recipe: list investigations, get one, pull alerts, pull schema changes, stitch a prose answer. Each step was an SDK call the model had to remember to make in the right order. With EvidenceCapsule, the skill is one typed call. The model asks the SDK for an explanation, the SDK returns a capsule, the model renders it. The collapse from five stitched calls to one is the highest-leverage change in this whole feature.

Why not just give the LLM more context and hope for the best?

A reasonable question, since "stuff more rows into the prompt" is the default reflex. Three concrete reasons.

Cost. Investigation context can easily be hundreds of rows once you include alerts, schema events, metrics, and lineage. Putting all of that in the prompt every time pays the token cost on every invocation. Building the capsule on the backend before the LLM call gives the model the minimal projection it needs and keeps token usage predictable.

Determinism. The capsule shape is the same every time. A prose answer's structure depends on the model's mood, the temperature setting, and the exact phrasing of the question. Downstream consumers (logging, eval harnesses, incident-ticket exports, the frontend renderer) cannot reliably parse free-form prose. They can reliably parse a typed schema.

Verifiability. This is the one that matters most. A flat blob of rows in the prompt does not constrain the model's output to those rows. The model can still confabulate around the data it was given. A typed capsule with source_id fields forces the output to be grounded in specific records that either exist or do not. The verifiability is structural, not behavioral.

What this unlocks next

Capsule-as-output is the substrate for several things we could not build before. Capsule diffing across runs (did the same investigation produce a different answer this week than last week?). Persistent capsules for incident postmortems (a stable record of what the agent saw and concluded, not just what it said). Eval harnesses that compare agent capsules against a ground-truth set on the same data. Export-to-Jira or export-to-Linear flows that turn an investigation into an actionable ticket with all the evidence pre-attached.

None of those are shipping today. They become buildable because the shape exists.

How to try it

EvidenceCapsule shipped to production. If you have an existing AnomalyArmor instance:

Web agent: ask any investigation question in the chat. The capsule renders below the prose answer.
SDK: upgrade to the latest Python SDK and call client.investigations.explain(asset_id, question).
Claude Code skill: update armor-investigate to the latest version and use it the same way you always have. The output shape is now structured JSON.

For background on why we are building observability tooling that lives inside your AI assistant rather than in a separate dashboard, see a data observability tool that works from inside your AI assistant. For how AI-native workflows change incident response in practice, see how AI-native data observability changes incident response.

EvidenceCapsule FAQ

Is this just RAG?

No. RAG (retrieval-augmented generation) is a pattern for retrieving relevant context and stuffing it into the prompt. EvidenceCapsule is a typed output schema with citations to the retrieved records. They are complementary: the agent retrieves relevant rows the same way it always did, then produces a capsule that lets you verify each claim against those rows. RAG is the retrieval side. EvidenceCapsule is the auditability side.

Does the LLM still write the prose answer?

Yes. The capsule is a structured projection. The prose narration on top is still generated by the model, grounded in the capsule. The difference is that any claim in the prose is checkable against a citation in the capsule.

What sources are supported in v1?

Alerts, schema changes, metrics, lineage edges, and intelligence notes. These are the structured-row sources investigations already touch. Raw log clustering and unstructured text ingestion are explicitly out of scope; if your evidence is in structured tables, it is supported.

Does this work with my own data?

EvidenceCapsule operates on the metadata AnomalyArmor already collects from your warehouse and integrations. It does not require any new instrumentation on your side beyond the existing AnomalyArmor connection.

Is the capsule format stable?

Yes, the schema is versioned and additive. The existing get_investigation tool returns the legacy JSON shape plus the new evidence_capsule field, so clients that do not opt in keep working unchanged.

Can I export a capsule to an incident ticket?

Not as a built-in flow yet. The capsule's typed structure makes that build straightforward, and it is on the short list of things we expect to ship as a follow-up. If you want to do it manually today, the SDK gives you all the fields.

How is confidence computed?

Confidence is aggregated from the underlying correlators (each evidence row carries its own confidence; the capsule's overall confidence is the aggregated result). It is not a second LLM call grading the first one. Rule-based aggregation is more predictable and avoids the "model grades itself" failure mode.

Why do you call it a "capsule"?

The naming comes from the broader trend of structured investigation outputs in observability tooling. The shape (root cause + triggers + consequences + confidence + open questions, each backed by typed evidence) is the load-bearing part, not the word.

If you want to see EvidenceCapsule in action on your own warehouse, AnomalyArmor is in private beta. Reach out and we will get you access.