Boundary mapMay 14, 2026

What The Bot Should Not Answer

Where the line sits between what an AI support agent answers on its own, what it routes to a human queue, and what it must always escalate. The model is rarely the problem.

What The Bot Should Not Answer

Why the boundary is the whole game

The most important deployment question in AI customer support is not whether the agent can answer a ticket. It is whether the agent should be allowed to answer that ticket without a human.

Most failed deployments fail at that boundary, not at the model.

The model is good enough for many informational requests, but the deployment is drawn so widely that the agent ends up making policy exceptions, giving regulated advice, trapping angry customers in loops, or resolving tickets the business later has to reopen. The opposite failure is just as common, and quieter: the team buys an AI agent, restricts it to a narrow band of FAQ answers, and leaves humans handling order status, password resets, return-window checks, and setup questions the agent could have handled with lower cost and faster response.

A settled framework does not exist.

Vendors converge on the operational triggers for handing off (sentiment, repeated questions, explicit human request, regulated topics, low confidence). Analysts split between automation optimism (Gartner forecasting 80 percent resolution by 2029) and operating caution (McKinsey emphasizing that scaling is limited by data quality, system integration, change management, and customer resistance). Academic research keeps surfacing the same finding: AI works best as a collaborator rather than a replacement, especially when intent is ambiguous or judgment matters. A Harvard Business School study of more than 250,000 support chats found AI helped human agents respond about 20 percent faster but only when AI complemented rather than replaced human judgment.

Four schools of thought are visible in practice. The automation-first school optimizes for deflection and cost reduction. The control-first school limits AI to low-risk FAQ work. The collaboration-first school uses AI to classify, draft, retrieve, and gather context, with humans kept in the loop for judgment. The risk-first school hard-escalates anything that touches legal, safety, privacy, fraud, medical, financial, or high-value relationship territory. A usable operator framework combines all four: automate where the evidence is strong, route where classification is reliable but judgment is needed, and escalate where the downside of being wrong is not recoverable.

The three categories of work

The three categories aren't ticket types. They're decisions the operator makes about each ticket. Two checks resolve every inbound: is the consequence severe, and is the request bounded. The figure below shows the flow. The three subsections that follow elaborate each branch.

Figure 1 — Boundary map: decision flow between bot answers, routed cases, and mandatory escalations.

The boundary becomes three decisions, not three archetypes. Escalate, answer, or route.

Answer: AI handles autonomously

Principle: AI can answer on its own when the request is bounded, source-grounded, low-consequence, reversible, and measurable. The answer should come from approved knowledge or a deterministic system lookup. The agent should not need to invent a policy, infer intent across several unresolved issues, negotiate, or bind the company to an exception.

This category includes order status lookups, tracking links, password resets, return eligibility inside a published policy, basic subscription changes, cancellation steps, simple billing FAQs, product setup from current help docs, shipping cutoffs, warranty status lookups, and known troubleshooting flows with a fixed decision tree. The shared pattern is that a human would otherwise copy a policy line, pull a value from a system, or send a standard instruction. The cost of getting it occasionally wrong is low enough that the business can repair without legal exposure or relationship damage.

Vendor-reported figures support this scope, with the caveat that they are vendor figures, not independent benchmarks. Intercom reports an average Fin resolution rate of 67 percent across more than seven thousand teams. Gamma reports 75 percent end-to-end resolution after deployment, with manual handling dropping from 94 percent to 24 percent and Customer Satisfaction Score (CSAT) staying at 84 percent. Klarna reported its OpenAI-powered assistant handled two-thirds of customer service chats in its first month, with customer satisfaction on par with humans and repeat inquiries down 25 percent. Decagon reports that Substack's agent resolves over 90 percent of inquiries, including refunds and cancellations through API actions. Sierra reports a UK retailer reaching over 70 percent resolution after rollout.

None of these numbers means every business should target 70 percent automation. They mean that routine, source-grounded work can move into Answer once the knowledge base is clean, authentication is handled, action permissions exist, and rollback paths are sound.

Route: AI triages but does not answer

Principle: AI should route when it can identify intent, collect context, and choose the next owner, but the actual resolution requires diagnosis, discretion, cross-functional judgment, or a system action that should not be delegated yet.

Route is the most underdesigned category. Many teams reduce the choice to bot-answers or human-answers and miss the work AI can do before handoff: detect intent, ask one or two clarifying questions, pull account context, summarize the conversation, tag the issue, assign the right queue, and tell the customer what will happen next. That is still automation. It just does not pretend that classification equals resolution.

Common Route tickets include technical support requiring diagnostic exchange, bug reports with screenshots or logs, billing disputes that cross a policy threshold, multi-issue tickets, unclear cancellation requests, complaints from high-value accounts, ambiguous refund claims, repeat contacts, authentication failures, angry but non-legal complaints, and tickets where the answer depends on customer history. The agent classifies and gathers; a human decides the outcome.

Vendor docs increasingly describe this middle state. Salesforce distinguishes default escalation (fires the moment a classifier sees a trigger and dumps the customer into a queue) from dynamic escalation (collects information, creates a case, routes intelligently, then hands off). Zendesk's escalation flows tag conversations before handoff so the human agent receives the right context. Decagon describes warm handoffs triggered by sentiment, complexity, direct human request, loops, high-value customer thresholds, and cases where empathy or creative problem-solving are needed.

A Route ticket is not a failure if the handoff is fast, contextual, and correctly assigned. It becomes a failure when the business counts route-worthy work as AI failure, or when the agent keeps trying to answer after it has enough evidence that judgment is required.

Escalate: must always go to a human

Principle: AI must escalate when the cost of a wrong answer exceeds any plausible model-confidence threshold, when the matter is regulated or safety-sensitive, when the customer asks for a human in a high-risk context, or when the agent lacks authority to bind the company.

This category is not about model intelligence. It is about authority and consequence. A support agent can be eloquent and still be the wrong actor. Escalate covers legal threats, chargebacks, fraud allegations, suspected account takeover, data privacy requests, security vulnerability reports, harassment or abuse, product safety claims, injury claims, medical or mental-health questions, regulated financial advice, credit decisions, employment claims, law-enforcement requests, media inquiries, executive escalations, and refund or policy exceptions outside a published rule. It also includes churn-risk customers above a defined value threshold when the relationship cost is material.

The escalation rule should be boring and strict. When a ticket touches law, health, safety, fraud, privacy, regulated finance, public reputation, or material relationship value, the agent may acknowledge receipt and gather neutral facts, but the decision moves to a human.

Failure modes when the boundary is drawn wrong

Over-automation

Over-automation happens when the agent answers tickets it cannot safely or reliably handle. The cost is loss of trust, rework, legal exposure, refund leakage, and disputes about whether the AI "resolved" a ticket that should never have been treated as resolved.

Air Canada is the clean public example. A customer relied on incorrect bereavement-fare information from Air Canada's chatbot. The British Columbia Civil Resolution Tribunal held the company liable for the chatbot's answer and awarded compensation, rejecting the argument that the chatbot was a separate legal actor. The lesson is not that chatbots can hallucinate. The sharper lesson is that policy advice with financial consequence needs an approved source, an audit trail, and a human path when the policy is nonstandard. The National Eating Disorders Association suspended its Tessa chatbot after users reported weight-loss and calorie-counting advice in a medical-adjacent support context. McDonald's ended its IBM AI drive-thru test after viral order mishaps showed that noisy real-world voice interactions are harder than demo conditions, especially when accents, dialects, background noise, and order ambiguity matter.

Under-automation

Under-automation is quieter, but it kills the business case. Humans keep handling tickets the agent could resolve, and the deployment looks like a failed investment even though the model could have handled more.

Evidence is mostly indirect because companies publicize automation wins, not missed automation. The Klarna and Gamma figures cited above show what scoping correctly looks like. The pattern in the failure direction: a support team leaves all routine, source-grounded work with humans, preserving cost without protecting quality.

Mis-routing

Mis-routing occurs when the agent correctly detects that it should not answer but sends the ticket to the wrong queue, omits context, or escalates before gathering the facts the next agent needs. The customer pays for it with re-explaining, handoff friction, and slower resolution.

Vendor documentation reveals how easy this is to get wrong. Salesforce's guidance specifically calls out the failure mode: a default escalation that fires the moment a classifier sees a trigger prevents the agent from collecting information before handoff. Zendesk tells teams to add tags and labels before handoff and to test the escalation experience so human agents receive the right context. Decagon's Notion case study reports that intelligent routing contributed to resolution-time improvements of up to 34 percent and that only 3.4 percent of customers asked for a human, suggesting routing quality is not cosmetic; it changes both customer effort and queue load.

Brittle escalation

Brittle escalation is the worst customer experience: the agent knows too little to solve the problem but keeps trying. The customer asks again, rephrases, gets another generic answer, and either abandons the conversation or becomes angry enough to damage the relationship.

The Consumer Financial Protection Bureau's banking-chatbot report describes this pattern directly: customers can get trapped in loops, receive inaccurate information, or be unable to reach a human when the issue requires tailored support. The DPD chatbot incident, where a frustrated customer induced the bot to swear and criticize the company after a parcel issue, is the brand-failure version of the same problem. DPD disabled part of the chatbot after the exchange went public.

Escalation cannot be a passive fallback. It has to be a designed state with triggers, limits, queue ownership, and measurement.

A methodology for drawing the line in your own org

1. Audit current ticket volume by category

Pull the last 30 to 90 days of tickets. If volume is low, use all of them. If volume is high, sample at least 500 per major channel. Classify each ticket by topic, customer intent, complexity, outcome, handle time, reopen rate, CSAT, queue, product area, customer value, systems touched, and whether the human response required judgment or only retrieval. Tag where the final answer came from: help center, policy, internal Standard Operating Procedure (SOP), CRM, order system, billing system, engineering investigation, legal or compliance, or manager discretion. The first output is a ticket map: which categories are routine, which are ambiguous, which require authority, which create risk. The second output is a baseline: current resolution time, repeat-contact rate, escalation rate, human effort, and customer satisfaction by category.

2. Map regulatory, policy, and brand constraints

Before testing AI performance, remove categories the agent should not own. Legal threats, chargebacks, fraud allegations, privacy-law requests, security reports, safety claims, regulated finance, medical-adjacent questions, injury claims, and policy exceptions above a threshold go directly into Escalate. Add business-specific constraints: enterprise accounts, VIP customers, renewal-risk accounts, accounts above a spend threshold, public complaints, press inquiries, or topics where the brand cost of a poor answer is high. This step prevents the most common operator mistake: letting model confidence decide a category that should be decided by authority and consequence.

3. Test the agent on representative samples

For each candidate Answer category, run representative historical tickets through the agent in a sandbox using the same knowledge base, policies, tool access, and escalation rules planned for production. Evaluate at least 50 examples per high-volume category. Include easy examples, edge cases, stale documentation, missing data, angry phrasing, multi-issue messages, and out-of-scope requests. Score on correctness, source use, policy compliance, action completion, escalation appropriateness, tone, customer effort, latency, and likelihood of repeat contact. Compare each AI response against the final human resolution, not just against a checklist. A ticket category qualifies for Answer only when the agent performs well on common cases and refuses or escalates cleanly on edge cases.

4. Write escalation triggers as product requirements

Soft guidance is not enough. Define three trigger classes. Immediate escalation: explicit human request in a high-risk context, legal threat, fraud allegation, safety claim, regulated topic, privacy request, security vulnerability, API failure on a required action, missing required source, or failed authentication. Offered escalation: frustration, repeated question, low-confidence answer, customer says the answer is wrong, repeated contact, or two failed resolution turns. Collect-then-route: bug report, technical issue, billing dispute, refund exception, high-value account, churn signal, or multi-issue ticket. Set thresholds for order value, refund value, customer tier, number of failed turns, sentiment, and repeat-contact windows. Assign each trigger to a queue owner.

5. Iterate with production data

Review the escalation queue weekly for the first month and at least monthly after that. Track false resolutions, reopen rate, repeat-contact rate, handoff reason, escalation trigger, human edit distance from the AI draft, CSAT by category, and whether the agent had the right source and system access. Categories where humans rarely change the agent's answer are candidates to move from Route to Answer. Categories where the agent frequently escalates, gets corrected, causes repeat contact, or produces policy disputes are candidates to move from Answer to Route or Escalate. Knowledge gaps become approved articles, SOP updates, or action procedures. The boundary should move, but only from evidence, not from vendor optimism.

Classification rubric

Use these archetypes to classify your own ticket types by analogy.

Ticket archetypeDefault categoryPrinciple
"Where is my order?" with authenticated lookupAnswerDeterministic lookup, low consequence, reversible.
Password reset or magic-link requestAnswerStandard workflow, no judgment required.
Return inside a published policy windowAnswerPolicy-bound and source-grounded.
Product setup question covered by current docsAnswerApproved knowledge, low downside.
Billing date or invoice-copy requestAnswerSystem lookup or standard account action.
Shipping-address change before fulfillmentAnswer or RouteAnswer only if the agent has safe action authority and rollback.
Refund outside policy or above value thresholdRouteIntent is clear; outcome needs discretion.
Bug report with screenshots, logs, or repro stepsRouteAgent collects context; diagnosis belongs to support or engineering.
Customer combines billing issue, bug, and cancellation threatRouteMulti-issue conversation needs decomposition and a human owner.
Customer says this is the third time they have contacted supportRouteRepeat contact signals unresolved context and higher trust risk.
Angry customer with no legal threatRouteEmotion increases relationship risk; collect context and hand off.
Explicit request for a human after failed AI answerEscalateCustomer has rejected the automated path.
Chargeback, legal threat, or demand letterEscalateLegal and financial consequence exceeds confidence.
Fraud allegation or suspected account takeoverEscalateSecurity and liability risk.
Privacy-law request, deletion request, or data-export disputeEscalateCompliance obligation and identity verification.
Medical, mental-health, safety, or injury questionEscalateHarm risk and regulated-adjacent judgment.
Security vulnerability reportEscalateRequires controlled intake and specialist handling.
VIP or renewal-risk account asking for an exceptionEscalateRelationship value and authority exceed bot mandate.
Classification rubric: ticket archetype, default category, and the principle that places it there.

What to validate before drawing your own line

Confirm your vertical's regulatory posture, contract obligations, privacy duties, refund authority, and brand sensitivity to AI tone failures. Measure your current ticket distribution and identify which categories are high-volume, low-risk, and source-grounded. Check whether your knowledge base is current enough for autonomous answers and whether the agent can read and write the required systems with audit logs. Test the vendor explicitly at the boundary: human requests, angry customers, ambiguous policies, missing data, API failures, repeat contacts, high-value accounts, and regulated keywords. Assign a human owner for weekly review of the escalation queue. If no one owns the iteration, the boundary will decay as products, policies, and customer behavior change.

The boundary is the deployment. The model is just what enforces it. Get the line right and any reasonable agent works. Get the line wrong and no agent saves you.

Share

Methodology

The framework was built by triangulating four kinds of evidence. Vendor deployment documentation from Intercom, Zendesk, Salesforce, Decagon, and Sierra established the operational triggers that practitioners actually configure. Public analyst frameworks from Gartner and McKinsey established the strategic axis between automation-first and collaboration-first deployment philosophies. Academic research, including the Harvard Business School study of more than 250,000 support chats and Utrecht University's work on chatbot collaboration, established what succeeds as a complement to human judgment versus what fails when AI replaces it. Public incidents (Air Canada, the National Eating Disorders Association's Tessa, the DPD chatbot, McDonald's drive-thru) established the failure modes the framework exists to prevent. The Answer category is grounded in convergent evidence: vendors agree on the automatable categories, case-study data points back the resolution rates, and operator deployments at Klarna, Gamma, Substack, and the unnamed Sierra retailer all settle in similar bands. The Route category is grounded in vendor handoff documentation that converges on the same trigger classes across Intercom, Zendesk, Salesforce, and Decagon. The Escalate category is grounded in regulatory materials, legal precedent (Moffatt v. Air Canada), Consumer Financial Protection Bureau guidance, and Federal Trade Commission enforcement on unsupported AI claims. Specific thresholds (outcome-value cutoffs, churn-risk markers, sentiment scoring) should be tuned to your vertical, customer base, and regulatory posture. The category framework should not need tuning.

Sources

  1. Intercom, Manage Fin AI Agent's escalation guidance and rules
  2. Intercom, Fin AI Agent automation rate
  3. Intercom, From resolutions to outcomes
  4. Intercom Research, To escalate or not to escalate
  5. Zendesk, About automated resolutions for AI agents
  6. Zendesk, Escalating advanced AI-agent conversations to Zendesk Support
  7. Zendesk, Understanding AI-agent tickets
  8. Salesforce, The art of the handoff
  9. Decagon, What is an AI escalation policy?
  10. Decagon, AI customer service agent capabilities
  11. Decagon, Substack case study
  12. Decagon, Notion case study
  13. Sierra, Expert Answers
  14. Sierra, Fast to build, faster to impact
  15. Gartner, Agentic AI will autonomously resolve 80 percent of common customer-service issues by 2029
  16. McKinsey, The contact center crossroads
  17. Harvard Business School Working Knowledge, When AI chatbots help people be more human
  18. Utrecht University, Customer service chatbots: from automation to collaboration
  19. Consumer Financial Protection Bureau, Chatbots in consumer finance
  20. American Bar Association, Moffatt v. Air Canada summary
  21. Wired, NEDA suspends Tessa chatbot
  22. AP News, McDonald's ends IBM AI drive-thru test
  23. The Guardian, DPD chatbot incident
  24. Gamma case study via Fin
  25. Klarna, AI assistant handles two-thirds of customer service chats
  26. Federal Trade Commission, Operation AI Comply

Tools mentioned