Watching Your Agents Work: Logs, Traces, and Metrics for Reliable AI Workflows

Watching Your Agents Work: Logs, Traces, and Metrics for Reliable AI Workflows

# agents# langchain# backend# ai
Watching Your Agents Work: Logs, Traces, and Metrics for Reliable AI WorkflowsKowshik Jallipalli

When a traditional REST API fails, it throws a 500 error and a clean stack trace. When an autonomous...

When a traditional REST API fails, it throws a 500 error and a clean stack trace. When an autonomous AI agent fails, it might silently retry a broken tool five times, burn $2 in tokens, and then confidently return a hallucinated response to the user.

Building agentic workflows—where an LLM decides which tools to call and how to route logic—is fundamentally non-deterministic. If you are just relying on print() statements to see what your LangGraph or custom while loops are doing, you are flying blind in production.

To ship reliable AI features, you need to instrument your agents just like any other distributed microservice. Here is exactly how to implement OpenTelemetry (OTel) tracing and structured logging to track agent health, token burn, and tool failures.

Why This Matters
An agentic workflow is essentially a dynamic dependency chain. If a user asks a Support Agent to refund a ticket, the agent might:

Call a vector database to find the refund policy.

Call the Stripe API to check the charge.

Call an internal API to issue the refund.

If the final output is wrong, you need to know: Was the prompt bad? Did the vector DB return the wrong chunk? Did the Stripe API time out? Tracing gives every request a trace_id and breaks down each step into a span, allowing you to visualize exactly where the LLM went off the rails.

How It Works: The Traced Agent
We are going to use standard OpenTelemetry in Python. Instead of locking ourselves into a specific vendor's LLM observability SDK right away, we will build a vendor-neutral trace structure.

Every time the agent is triggered, we start a parent trace. Every time the agent calls an LLM or executes a tool, we create a child span attached to that trace. We will inject the prompt, the tool inputs, the tool outputs, and the token counts as attributes (metadata) onto those spans.

The Implementation (Code to Copy)
Here is a minimal, realistic example of a traced agent loop. We use OTel decorators and context managers to automatically link spans together.

src/agent.py

import os
import json
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor

1. Setup minimal OTel provider (In production, export to Datadog/Grafana/Honeycomb)

provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("support_agent_service")

def tool_check_billing(customer_id: str) -> str:
"""Mock external API tool"""
return f"Customer {customer_id} is on the Pro plan."

@tracer.start_as_current_span("agent_execution")
def run_support_agent(ticket_id: str, user_query: str):
# Grab the parent span to inject top-level context
parent_span = trace.get_current_span()
parent_span.set_attribute("app.ticket_id", ticket_id)
parent_span.set_attribute("app.user_query", user_query)

# Step 1: The Tool Call Phase
with tracer.start_as_current_span("tool_call:check_billing") as tool_span:
customer_id = "cus_123" # Hardcoded for example
tool_span.set_attribute("tool.name", "check_billing")
tool_span.set_attribute("tool.input", json.dumps({"customer_id": customer_id}))
try:
    result = tool_check_billing(customer_id)
    tool_span.set_attribute("tool.output", result)
    tool_span.set_status(trace.StatusCode.OK)
except Exception as e:
    tool_span.record_exception(e)
    tool_span.set_status(trace.StatusCode.ERROR)
    raise e
Enter fullscreen mode Exit fullscreen mode

Step 2: The LLM Generation Phase

with tracer.start_as_current_span("llm_call:claude_3_5") as llm_span:
llm_span.set_attribute("llm.model", "claude-3-5-sonnet-latest")

# In a real app, this is where you call the Anthropic/OpenAI SDK
mock_response = "Based on your Pro plan, I can help you with..."

# Log cost metrics
llm_span.set_attribute("llm.usage.prompt_tokens", 450)
llm_span.set_attribute("llm.usage.completion_tokens", 85)
llm_span.set_attribute("llm.response", mock_response)
Enter fullscreen mode Exit fullscreen mode

return mock_response

Enter fullscreen mode Exit fullscreen mode




Trigger the agent

run_support_agent("tkt_8991", "Why was I charged twice?")
When this runs, your backend observability tool will stitch together a beautiful waterfall chart. You can filter your entire logging backend for app.ticket_id = 'tkt_8991' and instantly see the tool inputs, the LLM latency, and the exact token consumption for that specific user interaction.

Pitfalls and Gotchas
When instrumenting AI features, you will run into several specific friction points:

Logging massive payloads: If you dump entire RAG contexts (e.g., 50,000 tokens of embedded text) into a span attribute like tool.output, you will blow up your observability ingestion costs and likely hit payload size limits in tools like Datadog. Truncate long contexts or log pointers (like S3 URIs or DB row IDs) instead of raw text.

Leaking PII: LLM prompts often contain sensitive user data. If you blindly log app.user_query and llm.response to your centralized logging platform, you risk violating compliance. Implement a sanitization middleware to scrub emails, API keys, and credit cards before the span is exported.

Orphaned Spans: In multi-agent systems where Agent A triggers Agent B asynchronously (e.g., via a message queue like Kafka or SQS), the trace context drops. You must explicitly serialize the OTel trace context into the message header and deserialize it on the worker node to maintain a single continuous trace.

Ignoring the "Thinking" Time: Don't just log token counts. Log the "Time to First Token" (TTFT). If your agent feels sluggish to the user, TTFT metrics will tell you if the bottleneck is your vector database retrieval or the LLM's initial reasoning phase.

What to Try Next
Ready to stop guessing why your agents are failing? Try implementing these next steps:

The Cost Dashboard: Export your OTel spans to Grafana or Datadog and create a time-series chart querying sum(llm.usage.prompt_tokens). Group by app.agent_name to see exactly which workflow is driving up your Anthropic/OpenAI bill.

Dedicated AI Observability: If you don't want to build dashboards from scratch, swap the ConsoleSpanExporter for an exporter pointing to an AI-native observability platform like LangSmith, Braintrust, or Helicone. They offer purpose-built UI for evaluating trace payloads.

Alerting on Tool Failure Rates: Set up an alert monitor that triggers a Slack ping if the ratio of spans with tool.name: check_billing and status: ERROR exceeds 5% in a 10-minute window. This tells you an external API is down before your agent goes rogue trying to handle the errors.