How to Build an AI Support Agent That Actually Resolves Tickets

# ai# startup# machinelearning# productivity

Sunil Kumar

Technical post for engineers building or evaluating AI customer support systems. The difference...

Technical post for engineers building or evaluating AI customer support systems. The difference between an AI agent that resolves tickets and one that just triages them is almost entirely in the training data and escalation architecture.

Disclosure: I work at Ailoitte, which builds custom AI support agents deployed in Zendesk, Intercom, and Freshdesk. Sharing what the implementation actually looks like.

What's the actual technical difference between an AI support agent and a chatbot?

A scripted chatbot follows decision trees. It can answer questions it was explicitly programmed for and routes everything else to a human. Maintenance is manual — every new question requires a developer update.

An AI support agent uses RAG (retrieval-augmented generation) against your actual knowledge base. It retrieves relevant documentation, synthesises an answer in natural language, and generates a response the user can act on. New documentation automatically expands what it can answer. No manual programming per question.

The practical difference: the chatbot routes. The agent resolves.

What should the training data include?

Three sources, in order of importance:

1. Past resolved tickets — the most valuable training signal. Tickets that were resolved with high satisfaction ratings show the agent what good answers look like in your specific product context. Include the ticket text, the agent's response, and the resolution status.

2. Product documentation — official docs, help centre articles, FAQ pages. Chunk these semantically, not by fixed token count, to preserve logical relationships within articles. A how-to that spans multiple sections should not be split at 1024 tokens.

3. Internal knowledge base — anything your human support team uses to answer questions: internal wikis, Slack threads with documented resolutions, runbooks for common errors.

The mistake most teams make: training on documentation alone. The docs tell the agent what the product does. Past tickets show what a good answer looks like in actual user context. Both are necessary. Documentation-only agents tend to answer correctly but not helpfully.

How does escalation logic work?

The escalation decision should be based on three signals combined — not any one in isolation.

Signal 1: Confidence threshold

If the retrieval confidence is below a set threshold, the agent escalates rather than guessing. Starting point: 0.72. This needs per-product tuning — the right number for a developer tool differs from a consumer SaaS.

CONFIDENCE_THRESHOLD = 0.72

def should_escalate_confidence(retrieval_score: float) -> bool:
    return retrieval_score < CONFIDENCE_THRESHOLD

Signal 2: Topic classification

Certain topic categories should always escalate regardless of confidence: billing disputes, account security, legal questions, anything where a wrong answer has a meaningful cost. Maintain a classification list and check it before evaluating confidence.

ALWAYS_ESCALATE_TOPICS = [
    "billing_dispute",
    "account_security",
    "legal",
    "data_deletion",
    "fraud"
]

def should_escalate_topic(classified_topic: str) -> bool:
    return classified_topic in ALWAYS_ESCALATE_TOPICS

Signal 3: User sentiment

If the user has expressed frustration or repeated the same question more than once in the conversation, escalate even if the agent has a confident answer. The interaction has already broken down — a technically correct response will not recover it.

def should_escalate_sentiment(message_history: list) -> bool:
    frustration_signals = [
        "this is ridiculous", "not helpful", "speak to a human",
        "still not working", "useless", "worst"
    ]
    combined = " ".join([m["text"].lower() for m in message_history])
    return any(signal in combined for signal in frustration_signals)

The combined decision

def escalate(retrieval_score, topic, message_history):
    return (
        should_escalate_confidence(retrieval_score) or
        should_escalate_topic(topic) or
        should_escalate_sentiment(message_history)
    )

The escalation handoff must include:

Full conversation context
The agent's attempted answers
The reason for escalation (confidence / topic / sentiment)

Human reps should not need to ask the user to start over. The context must transfer completely.

What does the deployment architecture look like?

Zendesk

The agent integrates via the Zendesk Apps Framework. Incoming tickets trigger the agent via webhook. The agent retrieves context, generates a response, and either posts it as a public reply or escalates to the human queue with its analysis attached.

Ticket created
│
▼
Webhook → Agent API
│
├─ Confidence high + topic safe + sentiment OK → Post public reply
│
└─ Any escalation signal → Move to human queue + attach analysis

Intercom

Similar pattern via the Intercom API and Fin framework — or a custom integration for teams that need more control over the response logic. The Fin framework is faster to deploy; custom integration gives more control over the escalation handoff format.

The key architectural decision: synchronous vs async

Live chat → synchronous is required. Response must arrive within the user's attention window (~3–5 seconds). This constrains retrieval depth — you cannot run a slow cross-encoder re-ranker in a synchronous live chat flow.

Email-based tickets → async with a short queue is acceptable. This allows batching, deeper retrieval, and more careful confidence evaluation. The latency budget is minutes, not seconds.

Most support systems need both paths. Design for the sync constraint first; async is easier to add.

What actually determines resolution quality?

In order of impact:

Training data quality — past tickets with high satisfaction ratings, not just documentation dumps
Semantic chunking — preserving logical relationships within docs rather than splitting at token boundaries
Escalation logic — catching the cases where a technically correct answer is not the right answer
Handoff quality — how much context the human rep receives on escalation

The model is rarely the bottleneck. A well-structured RAG pipeline with good training data and correct escalation logic will outperform a better model with poor data and binary escalation.

What escalation architecture are you using — threshold-based, classification-based, or hybrid? Has anyone moved away from the confidence threshold as the primary signal? Curious what's actually working in production.