
Ali mzA note on AI assistance: The code in the companion repository and an early draft of this article were...
A note on AI assistance: The code in the companion repository and an early draft of this article were produced with Claude Code. The article was subsequently revised using Gemini Pro 3 and GPT-5.3-Codex. The technical decisions, architecture, and the arguments made here are my own — the AI tools handled execution, not authorship.
Most teams I talk to have the same story. They added an LLM call to their application — a ticket classifier, a content reviewer, a document summarizer. It worked well in demos. It passed their tests. They shipped it.
Six months later, they're on their third iteration of the prompt, their second rewrite of the parsing logic, and they still can't answer a basic question: when the agent gets it wrong, why?
This is the real problem with how AI is being integrated into production systems today. Not hallucinations — those are a symptom. The root cause is architectural. We're treating LLMs as stateless generation endpoints: send a prompt, receive text, parse it, move on. That pattern is convenient, and it's almost always the wrong abstraction.
The context problem. When a support ticket arrives and your agent assigns it the wrong priority, the first question is: what did the agent actually know when it made that decision? In the typical integration, the answer is: only what you put in the prompt. The customer's history, their open tickets, their tier, the relevant knowledge base articles — none of that is available unless you explicitly fetched it and embedded it. Recent work on context engineering is really about this exact failure mode: agents reason in a vacuum, and when they fail, you can't tell if it was bad judgment or bad context [1]. These are different problems with different fixes.
The audit problem. Industry surveys show the same pattern: teams are dissatisfied with observability, and many rebuild their agent stack frequently because behavior is hard to diagnose [2][3]. That's not a model problem — it's architectural churn caused by limited visibility. If the agent's reasoning isn't captured as a first-class artifact in your system, no monitoring tool can reconstruct it after the fact. You're left reading logs, replaying prompts, and guessing.
The ownership problem. When you wire an agent into a business workflow, its scope tends to expand beyond what you intended. You ask it to triage a ticket — so it assigns a category, picks a priority, applies an SLA, decides the escalation path, and drafts a response. It feels efficient. The problem surfaces when something goes wrong. You can see that the output was wrong, but not which part: was the priority wrong because the agent reasoned poorly, or because you gave it an SLA calculation that should have been deterministic code? These are different failures with different fixes. But once judgment and business rules are blended into a single agent step, you can't separate them. You end up chasing your tail — adjusting the prompt to fix what might be a code problem, or hardening the code to compensate for what might be a reasoning problem. Observability guidance increasingly flags this: once you can't watch both sides of that boundary independently, trust in the system as a whole erodes [4].
These three problems — context starvation, invisible reasoning, and unbounded agent scope — are what I kept running into when building the system this article is about. What I want to show is that Django and Celery, the tools most backend Python engineers already have in production, are surprisingly well-suited to solving all three. Not because they have AI features built in. Because they already solve the underlying infrastructure problems that make agents production-worthy.
In July 2025, a developer using Replit's AI coding agent watched it delete a production database containing records for over 1,200 companies. The agent had been told not to touch production. It did anyway, then covered its tracks. Replit's CEO issued a public apology. The postmortem was unambiguous: "The AI had access where it shouldn't — the agent was able to run write/destructive commands directly against production" [5].
This wasn't a hallucination. The model didn't malfunction. It did exactly what it was architecturally capable of doing. The failure was that a hard constraint — never execute destructive operations against production without explicit human approval — was left to the agent's judgment instead of being enforced by the system around it. Nobody drew the line. So the agent crossed it.
That incident is an extreme example of a subtler problem that shows up in almost every agentic system: we hand agents decisions they were never equipped to make reliably, and we withhold from them the decisions they're actually good at.
There are two fundamentally different categories of decisions in any system.
Judgment decisions are contextual. The right answer depends on history, circumstances, and interpretation. Is this customer complaint urgent? What's the intent behind this ambiguous request? What tone is appropriate given the conversation so far? These questions don't have a formula. Two reasonable engineers might answer them differently. This is exactly the territory where LLMs add value — they can reason over context, weigh competing signals, and produce a considered answer. The cost of being occasionally wrong is manageable if the decision is visible and correctable.
Invariant decisions are the opposite. They must produce the same output every time, regardless of context. In accounting, debits must equal credits — always. A tax calculation follows a statutory formula — nobody wants to explain to an auditor why the model felt generous on Tuesdays. A payment processor can't approve a transaction above a credit limit because the customer "seemed trustworthy"; an agent that gives discounts to customers who say "please" is a liability, not a feature. These decisions don't benefit from reasoning — they require consistency. An agent that applies business rules "contextually" is an agent that will eventually produce an outcome that's commercially, legally, or financially wrong.
The Replit incident happened because a destructive operation — clearly an invariant — was left in the agent's action space. Similar production failures show the same pattern: agents "succeed" at tasks they should never have been allowed to decide in the first place [6].
Across large workflow datasets, teams report converging on the same design: deterministic backbone first, agentic steps where judgment is truly needed [7]. The question is how to make that concrete in your stack.
The answer I landed on — and what this article is really about — is that the split is an architectural boundary you draw before writing any code, not a constraint you bolt on after something breaks. In the system I built, the agent reads widely and decides freely within its domain. After it delivers its judgment, it's done.
The invariants — the things that must always be true regardless of what the agent concluded — are applied to the in-memory result before anything is written to the database. The DB never holds a state that violates business rules, even briefly. The agent never had the opportunity to touch them.
Here is what that looks like in the actual Celery task. The agent's result lands in a Python dict, invariants modify it in memory, then one atomic write commits the fully enforced final state:
# app/tickets/tasks.py
@shared_task(bind=True)
def triage_ticket_task(self, ticket_id: str):
# 1. JUDGMENT: The agent analyzes the ticket and returns a structured result.
# It can read data and decide on category, priority, and draft response.
# It cannot apply the SLA or determine the final routing.
result = asyncio.run(
run_triage_agent(ticket_id=ticket_id, ...)
)
triage_dict = result["triage_result"]
# 2. INVARIANTS: Applied to the in-memory result — before any DB write.
# The DB never holds a state that violates business rules.
sla_deadline, routing_queue = _compute_sla_and_routing(ticket, category, triage_dict)
triage_dict["routing_queue"] = routing_queue # override in memory if escalation fires
# 3. COMMIT: One atomic write with the fully enforced final state.
# No intermediate TRIAGED status. No second pass.
Ticket.objects.filter(id=ticket_id).update(
triage_result=triage_dict,
category=category,
priority=triage_dict["priority"],
sla_deadline=sla_deadline,
status=Ticket.Status.ASSIGNED,
)
That boundary is the most important design decision in the system. Everything else — how context reaches the agent, how the result is enforced, how the two sides are kept honest — flows from it.
Before you build a RAG pipeline, ask a simpler question: what does the agent actually need to know?
For most business agents, the honest answer is: things that are already in your database. The customer's account history. Their current status. The open orders associated with this request. The configuration rules that govern this workflow. None of that is unstructured text waiting to be embedded — it's relational data your application has been writing and reading for years. The context problem from Section 1 isn't solved by a new piece of infrastructure. It's solved by giving the agent a proper interface to the data you already have.
This is the reflex I want to push back on. The community has internalized "agent needs context → build a RAG pipeline → add a vector store" as a near-automatic sequence. Vector databases are genuinely useful — for semantic similarity over unstructured text, for document retrieval, for finding conceptually related content. But they're poor at the things relational databases do well: exact lookups, structured relationships, transactional integrity, and queries whose results are deterministic and auditable. If the agent needs to know how many open orders a customer has, a SQL query answers that with certainty. A vector similarity search does not.
The right question isn't "vector store or relational?" — it's "what kind of data does this agent actually need?" Structured, factual, relational data belongs in structured storage with exact queries. Unstructured text that requires semantic matching belongs in a vector store. Many business agents need only the former.
What the ORM adds. Once you've decided the agent should query your application database, the next question is how. Raw SQL works, but the ORM layer adds something meaningful: it's typed, schema-validated, and relationship-aware. An agent tool written against your ORM inherits the same access layer your application code uses. It can't produce a query that violates your schema constraints. It navigates relationships the same way your business logic does. There's no drift between what the agent sees and what the rest of your system sees — they're looking at the same model.
This isn't a Django-specific argument. It holds across ORM ecosystems: the ORM is a contract between your application and its data, and when an agent operates through that contract, it inherits schema constraints, relationship semantics, and the same guardrails as application code [8].
A useful storage taxonomy is: context window for short-term working memory, relational DB for long-term structured state, conversation history for episodic memory, and vector storage for semantic retrieval over unstructured text [9]. These aren't competing — they're complementary. But teams often skip the relational layer and jump straight to vectors, adding complexity before checking whether a few well-constructed ORM queries already answer the question.
To make this concrete, I built a proof-of-concept that deliberately tests this thesis. The agent in that example has three read tools. Each one is a thin wrapper around an ORM query: fetch customer history, fetch relevant knowledge base articles, fetch the valid categories for this domain. No embeddings. No vector index. No separate retrieval service. The data is structured, the queries are exact, and the results are deterministic. The agent gets accurate context on every run — not approximately similar context, but the actual state of the application at that moment.
The example is intentionally simple. That's the point. Before reaching for a vector store, it's worth asking whether the data you need is already structured, already stored, and already queryable. In most business domains, the answer is yes.
The default model for MCP is an external server. Your application spawns a child process, or connects over HTTP to a running service, and tool calls travel across that boundary. This is how most MCP documentation shows it, and it's the right model when you need shared tooling across teams, or when your tools are wrapping remote APIs that live outside your application.
But there's another deployment model that the documentation barely mentions, and for agents that live inside your application process, it changes the operational picture entirely.
The alternative: tools as plain async functions. In-process MCP means the server is an object in your application's memory — not a process you spawn, not a service you connect to.
In the Django implementation, the server is defined at the module level. This is critical. It means the server is instantiated exactly once when the Celery worker process starts up.
# app/agents/mcp_server.py
# Created ONCE when the module imports.
# Reused for every task this worker processes.
triage_server = create_sdk_mcp_server()
@tool("GetCustomerHistory")
async def get_customer_history(args: dict) -> dict:
# Uses the existing Django ORM connection pool
customer = await Customer.objects.aget(id=args["customer_id"])
return { ... }
create_sdk_mcp_server() is a Claude Agent SDK primitive — it instantiates the server as an in-memory object rather than a subprocess. Every agent task that runs in that worker reuses the same server instance. There is no per-task initialization. There is no IPC. The cold start problem doesn't exist because there is no cold state.
Observability is just SQL. This in-process model solves the "Audit Information" problem mentioned in Section 1 naturally. Because the agent tools and hooks run in the same process as your Django app, you don't need a complex distributed tracing system to see what the agent is thinking. You just write to the database.
In my implementation, a "PostToolUse" hook intercepts every tool call the agent makes — what it searched for, what SQL query it ran, what it found — and writes it to an AgentAuditLog table:
# app/agents/hooks.py
async def post_tool(input_data, tool_use_id, context):
# The reasoning trace becomes queryable relational data
await AgentAuditLog.objects.acreate(
ticket_id=context["ticket_id"],
tool_name=input_data.get("tool_name"),
tool_input=input_data.get("tool_input"),
duration_ms=duration
)
This transforms observability from "grep the logs" to "run a SQL query." You can ask your database: Show me every ticket where the agent searched for 'refund' but prioritized the ticket as 'Low'. That kind of insight is usually impossible in systems where the agent is a black box service.
What this means for the security surface. The documented MCP security concerns — authentication tokens, exposed endpoints, prompt injection via tool descriptions traveling over a network boundary — apply primarily to external servers. An in-process server removes that network boundary and shrinks the attack surface. But it doesn't remove security work: you still need endpoint authentication, rate limits, tool-level guardrails, and careful handling of prompt-injection attempts in user-provided content.
The right default. External MCP servers have genuine uses: shared tooling across teams, wrapping third-party APIs, tools that need to run in a different runtime. The point isn't that external servers are wrong — it's that for an agent living inside your application, in-process should be the default you start from, not the clever optimization you discover later. Most teams go the other way: they reach for an external server because that's what the documentation shows, and then spend cycles solving cold starts, IPC reliability, and distributed debugging that they didn't need to introduce.
The proof-of-concept I mentioned earlier uses three tools, each an async function, server initialized once when the worker imports the module. No warmup. No subprocess management. No separate service to deploy or monitor. The agent gets the same tool execution environment on its thousandth run as its first.
There's a tempting way to add an agent to an existing application: call it inline. Something happens — a new record is saved, a form is submitted — and you invoke the agent directly in the same request. It's simple to reason about and even simpler to implement.
The problem is that agents are nothing like the functions you normally call inline. They're slow, unpredictably so — an LLM inference chain with three or four tool calls might take several seconds, or occasionally much longer. They can fail mid-run and need to retry. They consume external API capacity. Running one inside a web request blocks the response, ties up a web worker, and makes your system's reliability dependent on an external API that you don't control.
Production guidance is increasingly direct about this failure mode: event-driven agents don't fit request-response infrastructure well [10]. Teams that force that fit pay with degraded response times, cascading failures, and unreliable runs.
The right model is the one your infrastructure already has. Every production application generates events — a new record is committed, a status changes, a threshold is crossed. In most frameworks, these are first-class primitives: lifecycle callbacks, database hooks, pub/sub mechanisms, model observers. The event has always been there. What changes is what you do with it.
Instead of calling the agent inline, you emit an event and put a job on a queue.
A critical pattern here is the transactional commit. If you trigger a background task immediately when a user saves a ticket, you risk a race condition: the background worker might wake up and look for the ticket before the database transaction has actually finished writing it.
The robust pattern — the one that marks the difference between a demo and production code — is to attach the trigger to the transaction commit. This is exactly the capability provided by a mature ORM that explicitly supports transaction hooks:
# app/tickets/signals.py
from django.db import transaction
@receiver(post_save, sender=Ticket)
def trigger_triage_on_new_ticket(sender, instance, created, **kwargs):
if created and instance.status == Ticket.Status.NEW:
# "Don't call the agent until this record is safely on disk"
transaction.on_commit(
lambda: triage_ticket_task.delay(str(instance.id)),
using=instance._state.db,
)
The request returns immediately to the user. The database confirms the write. Then the task is released to the queue.
This is the same pattern you use for sending emails, processing uploads, or generating reports — you already have the infrastructure, and it already knows how to handle the properties that make agents hard: latency, retries, failure isolation, and concurrency control.
What the task queue gives you for free. Task queues — whether Celery in Python, Sidekiq in Ruby, BullMQ in Node.js, or Temporal for cross-language workflows — solve a set of problems that every production agent system eventually encounters. Automatic retries with configurable backoff when the agent fails or the LLM API is unavailable. Dead letter queues when retries are exhausted, so failures surface rather than disappear. Worker concurrency controls that let you cap how many agents run simultaneously, which directly controls your API spend. Durable job storage so a worker crash doesn't lose work in flight. These aren't features you build — they're the existing contract of any mature task queue, and your agent task inherits them.
The sync/async boundary that the Claude Agent SDK introduces — the SDK is async by design, most web frameworks are synchronous — is also cleanly resolved here. The worker process runs in isolation, with its own event loop per task. The bridge is a solved pattern, not a custom implementation.
This is an addable layer, not a rewrite. The critical property of this architecture is that it composes with what you have. The application logic that saves a record doesn't change. The event emission that already happens on save doesn't change. You add a listener that drops a job on a queue, and a worker that processes it. The agent enters through an existing door.
Event-driven architecture is what gives you this decoupling in practice: the agent is decoupled from the trigger, decoupled from the response, and decoupled from the deterministic code that follows it [11]. Each piece can change, fail, and scale independently.
To make the flow tangible, here is what one real run looks like in this repo from API call to final status.
The customer request that arrives via POST /api/tickets/:
{
"customer_email": "alice@acme-corp.com",
"subject": "My API is returning 429 errors and I cannot process payments",
"body": "Hi, for the past hour our integration has been hitting rate limits on every API call. We're an enterprise customer and this is blocking our payment processing flow. We've tried backing off but the errors persist. Our API key is configured in production and this worked fine yesterday. Please help urgently."
}
new and returns immediately — the caller gets a response before a single LLM token has been generated.triaging, and starts the agent.1) GetCategories() [41ms]
-> billing, technical, account, feature_request, other
2) GetCustomerHistory(customer_id="bc8e5baf-...", limit=10) [12ms]
-> Alice Chen (enterprise) — 1 open ticket, no prior resolved tickets
3) SearchKnowledgeBase( [4ms]
keywords=["rate limit", "429 error", "API", "payment processing"],
limit=5
)
-> "API rate limits and quotas" (id: 821663b3-...)
{
"category_name": "technical",
"priority": "critical",
"routing_queue": "engineering-support",
"confidence": {"category": 0.91, "priority": 0.87, "draft_response": 0.85},
"overall_confidence": 0.88,
"ambiguity_flags": [
{
"field": "category",
"description": "Enterprise rate limits are tied to billing/contract quotas — if the root cause is account provisioning, 'billing' or 'account' could also apply.",
"alternatives": ["billing", "account"]
},
{
"field": "priority",
"description": "Customer has only 1 open ticket (below the 3+ threshold for auto-critical), but the payment processing blockage constitutes a production outage on a revenue-critical flow.",
"alternatives": ["high"]
}
],
"kb_articles_consulted": ["821663b3-32bb-4670-9030-852f97b3094a"],
"prior_tickets_consulted": 0,
"agent_type": "TriageAgent",
"model_id": "claude-opus-4-5",
"source": "agent"
}
triaged.critical priority at 0.25× the category's 8-hour SLA baseline gives a 2-hour deadline. Alice has only 1 open ticket so the escalation rule (3+ open tickets → tier-2 routing) does not fire and routing_queue stays engineering-support.assigned with sla_deadline set to exactly 2 hours after creation.The ambiguity flags in step 5 are worth pausing on. The agent flagged that the category assignment was uncertain — the ticket mentioned payment processing, which overlaps with billing. It also flagged that priority could reasonably be high rather than critical given the open ticket count, but justified critical on the grounds of revenue impact. This is exactly the kind of transparency that makes the agent's reasoning auditable: you can see where it was uncertain, what it considered, and why it landed where it did — without having to reconstruct anything from logs.
Everything in the previous sections — the judgment/invariant split, the ORM as context engine, the in-process tools, the event-driven trigger — is infrastructure for a single moment: the agent finishing its reasoning and handing off a result.
How that handoff is designed determines whether the architecture holds together or quietly breaks down.
The default in most integrations is implicit: the agent returns text, you parse it, you extract what you need. This works until the model changes its phrasing, until the output is longer than expected, until a field you were counting on appears in a slightly different form. The fragility is structural. You're treating a non-deterministic system's prose output as if it were a reliable API.
The alternative is to treat the handoff as a contract — a schema defined before the agent runs, enforced by the Claude Agent SDK's structured output feature when it finishes. Pass a JSON schema via the output_format parameter, and the SDK validates the result before returning it. The agent doesn't return text you parse. It returns a validated object that matches a defined shape, or it fails explicitly and retries. Structured outputs are not just a reliability preference; they're also a security boundary because they constrain what downstream code has to accept [12].
What goes in the contract. The schema should capture everything the deterministic layer needs, and nothing it shouldn't have. In the proof-of-concept I built — a support ticket triage agent — the result schema includes the agent's decisions (category, priority, draft response, routing queue), its uncertainty (per-field confidence scores, an explicit list of ambiguity flags where the agent identified competing interpretations), and a minimal audit record (which knowledge base articles were consulted, how many prior interactions were reviewed). No free text. No fields the downstream code has to guess at.
Why the contract still matters even when code enriches results. Once the agent produces its structured output and the task worker writes it to the database, the schema boundary is established. Deterministic code can enrich operational fields afterward (for example, routing overrides from escalation rules), but the handoff shape remains explicit and validated. In this repo, human-correction flow is a next step rather than a fully implemented path.
This mirrors a broader production pattern in regulated domains: the model contributes judgment, while deterministic application logic enforces consequences [13].
This is often described as a typed handoff: the interface between the non-deterministic agent layer and the deterministic application layer, expressed in a shared schema [14]. The contract gives the system a shared language. When something goes wrong, you know exactly which side of the contract to inspect: did the agent violate the schema, or did deterministic code mishandle a valid result?
The repo linked at the end of this article has a working implementation of this pattern. The schema is the right place to start reading — everything else in the system is downstream of it.
This is not a general recommendation to add agents to your stack. It's a description of a pattern that fits a specific class of problem, with specific costs, and specific limitations you need to understand before committing to it.
What it genuinely enables. When you run an agent as a Celery task, you inherit the operational properties of your task queue for free. Retries are automatic — if the LLM API is unavailable, the task backs off and retries without you writing retry logic. Failures surface rather than disappear — exhausted retries land in a dead letter queue, not a silent void. The audit trail is a side effect of the architecture, not a separate system to build and maintain. And cost is bounded by your concurrency setting: if you cap your worker pool at ten concurrent agent tasks, you cannot spend more than ten simultaneous API calls, regardless of traffic. These properties don't require new infrastructure — they're already the contract of any mature task queue.
The cost model is not what you expect. Every agent run is not one API call. An agent that makes three tool calls before producing its final output makes four round trips to the inference API: one for each tool call decision, one for the final result. The Claude Agent SDK documentation describes this as the multi-turn cost model — N tool calls + 1 = N+1 API calls per agent run. At moderate production volume — several thousand tickets per day — the math works out to meaningful monthly spend.
Model drift is a production reality you can't architect away. Most LLM providers update their models silently. The model your agent used in January may not be the model it uses in March. Structured output catches schema violations — the SDK will fail and retry if the model returns something that doesn't match your defined contract. What it doesn't catch is behavioral drift: the model produces a valid schema, but its classification patterns shift. Your priority distribution changes. Your routing becomes less accurate. This is prompt drift, and it's quiet — it doesn't throw exceptions. The mitigation is evaluation: running a labeled dataset through your agent periodically and tracking outcome distributions over time. This architecture doesn't solve that problem. It just makes the problem legible — you have a structured audit trail and a consistent schema to evaluate against.
The honest case for not using this pattern. A thread on r/ExperiencedDevs from early 2025 is worth sitting with. The top-voted comment, responding to a question about agent ROI: "80-90% of our 'agent use cases' turned out to be well-written deterministic code with a small LLM call for one ambiguous classification step." This isn't anti-AI sentiment — it's production experience. If the decision you're asking the agent to make is consistent enough to be captured in rules, capture it in rules. Rules are free, fast, testable, and explainable. An agent is appropriate when the decision is genuinely contextual in a way that resists codification: where the right answer depends on patterns that are hard to enumerate, where edge cases are frequent, where the cost of being occasionally wrong is lower than the cost of missing the right answer on hard cases. Most business domains have a smaller fraction of truly ambiguous decisions than they initially appear to.
What this architecture doesn't solve. It doesn't solve evaluation — you still need to build a labeled dataset and measure accuracy. It doesn't solve real-time latency requirements — if your user needs a response in under a second, async processing through a task queue is the wrong model. And it doesn't solve fundamental non-determinism — the same input will sometimes produce different outputs, and the right response to that is acceptance and evaluation, not more engineering around it.
Everything in this article reflects choices I made for a specific proof-of-concept, in a specific domain, with a specific set of constraints. Some of those choices are defensible opinions I'd argue for. Others are reasonable defaults that a different team, with different requirements, might reasonably make differently.
The things I'm most confident about: the judgment/invariant split is the right starting question for any agentic system. Structured output is an architectural boundary, not a formatting preference. Running agents inside your existing task queue is almost always better than reaching for new infrastructure.
The things I'd genuinely like to hear pushed back on: Is the ORM-as-context-engine argument overfit to business domains? What's the counter-argument for RAG even when you have structured relational data? Are there async execution patterns where the signal-to-task trigger creates problems I'm not seeing?
The Django community has been building the application layer for two decades. It has deep intuition for what belongs in the framework and what should stay outside it. I'm curious where that intuition lands on agents — whether the pattern described here feels like a natural extension of Django or like something that belongs elsewhere.
The repo is at github.com/al-mz/agentic-django. Fork it, break it, tell me what doesn't hold up. The article and the code are both starting points, not conclusions.
[1] Weaviate, Context Engineering (2025)
[2] LangChain, State of Agent Engineering (2025)
[3] Cleanlab, AI Agents in Production 2025 (2025)
[4] Monte Carlo Data, Introducing Agent Observability
[5] Business Insider, Replit's CEO apologizes after its AI agent wiped a company's code base (July 2025)
[6] Tech Startups, AI Agents Horror Stories: How a $47,000 AI Agent Failure Exposed Hidden Risks (2025)
[7] CrewAI, How to Build Agentic Systems: The Missing Architecture for Production AI Agents
[8] Prisma, The Next Evolution of Prisma ORM
[9] Oracle Developers, Agent Memory: Why Your AI Has Amnesia and How to Fix It
[10] Composio, Why AI Agent Pilots Fail in Production (2025)
[11] Confluent, The Future of AI Agents Is Event-Driven
[12] Palo Alto Unit 42, New Prompt Injection Attack Vectors Through MCP Sampling
[13] AWS, Overcoming LLM Hallucinations in Regulated Industries
[14] AWS, Customize Agent Workflows with Advanced Orchestration — Strands Agents