Back to Podcast Digest
AI Engineer37m

The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks

TL;DR

  • Model choice should come late, not first: Bhaumik says most teams start with GPT vs. Claude debates, but in one 8-week banking POC his team picked the model in week seven after building evals, tracing, and data plumbing first.

  • Three gaps kill production AI: He frames the recurring failures as an observability gap, an evaluation gap, and a governance gap, meaning teams cannot see what the system did, cannot measure what matters, and do not know who owns failures at 3 a.m.

  • Evaluation needs three layers: Deterministic checks catch formats and PII, semantic checks use LLM-as-a-judge for groundedness and relevance, and behavioral evals catch expensive issues like an agent making three duplicate database calls for one account-balance answer.

  • Data quality becomes unforgiving with agents: Bhaumik says humans tolerate bad data in reports, but agents do not, because they return wrong answers confidently, which is why he spends about 60% of project time on question data and tracking data foundations.

  • Tracing turns AI from a black box into an operable system: In a banking dispute flow, traces exposed every step from intent classification to policy retrieval to guardrails, and later helped diagnose a CSAT drop caused by an outdated policy document in the vector database.

  • Governance includes prompts, models, and incident response: He argues prompt versioning should follow enterprise change management, model updates must be tested against your own eval set, and AI incidents need a playbook of detect, diagnose, contain, and fix tied into ITSM systems.

The Breakdown

A retail bank burned $85,000 on a chatbot POC that looked great in demos and fell apart in production, until Sandipan Bhaumik rebuilt the project around five pillars and delayed model selection until week seven. His core point is blunt: enterprise agents fail less because of model choice than because teams skip evaluation, tracing, data foundations, orchestration, and governance.

Was This Useful?

Share