Alcreon
Back to Podcast Digest
AI Engineer··26m

From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work — Sandipan Bhaumik

TL;DR

  • Multi-agent AI breaks like distributed systems, not prompts — Sandipan Bhaumik says teams fail when they treat “adding five more agents” like adding features, when the real problem is coordination, race conditions, and cascading failures across 10+ agent relationships.

  • A real credit decisioning deployment went sideways because of stale cache, not bad models — in a financial services system with five agents, 20% of decisions got incorrect risk ratings because PostgreSQL writes succeeded but cache invalidation failed, so another agent read a stale score 500 milliseconds later.

  • Orchestration beats autonomy in regulated production environments — Bhaumik recommends centralized orchestration for complex, stable workflows like credit decisions because it gives one execution graph, retries, rollback, and clear debugging, while choreography only works if you have truly strong observability.

  • Immutable state snapshots with versioning are the antidote to race conditions — instead of letting multiple agents update the same record, each agent should append a new sealed state version, making rollback, replay, lineage, and debugging far easier.

  • Data contracts, circuit breakers, and saga compensation are non-negotiable production patterns — agents need schema-validated handoffs, fail-fast protection after repeated downstream errors, and reversible execute/compensate steps so partial failures don’t corrupt the whole workflow.

  • His Databricks reference stack is basically a production recipe — LangGraph plus Mosaic AI Agent Framework handles orchestration, Unity Catalog versions and governs agents and schemas, Delta Lake stores immutable state rows, and MLflow traces latency, inputs, outputs, and token usage per agent call.

The Breakdown

The Big Reframe: Multi-Agent Means Distributed Systems

Sandipan Bhaumik opens with the core warning: once you move beyond one agent, you are no longer “just building AI,” you’re building a distributed system. He says he’s watched strong engineers repeat the same mistake for two years in production — assuming more agents is like adding more features, when it actually multiplies coordination and failure modes.

The Credit Decisioning Failure That Made the Point

His war story is a credit decisioning system for a financial services company: credit score, income verification, risk assessment, fraud detection, final approval. The single-agent version worked for two weeks with zero issues, but after adding four more agents, 20% of decisions had incorrect risk ratings because one agent wrote a score of 750 while another read a stale cached value of 680 just 500 milliseconds later.

Why Five Agents Feel 25 Times Harder

Bhaumik explains the complexity jump in simple but painful terms: one agent has zero coordination problems, two agents have at least one connection, and five agents create at least 10 potential coordination links. Every one of those links can become a race condition, stale-state issue, or synchronization bug, which is why the complexity curve “doesn’t lie.”

Choreography vs. Orchestration: The Real Design Choice

He walks through the two coordination patterns most teams need to choose between. Choreography is event-driven and loosely coupled — agents publish and subscribe to events on a bus — which scales autonomy nicely, but debugging becomes detective work unless observability is excellent; orchestration, by contrast, puts a central workflow engine in charge of sequencing, parallelism, retries, and logs.

Why Regulated Teams Usually Pick the Conductor

For industries like financial services, he says orchestration wins almost every time because when something goes wrong, you need to know exactly which agent acted, in what order, and with what data. He gives a practical implementation angle too: LangGraph wired into Databricks’ AI agent framework is one example, but really any DAG-based workflow engine with solid retries fits the pattern.

Shared Mutable State Is the Trap

The next failure point is state management, and he’s blunt about the common mistake: multiple agents reading and writing the same database row and assuming the database will save them. His fix is immutable, append-only state snapshots with versioning, where each agent receives version N, validates the schema, creates version N+1, and never mutates the old state.

Contracts at the Boundary, Not Garbage Three Steps Later

He pairs immutable state with data contracts so agents can’t just fling arbitrary payloads at each other. His example is a research agent handing off findings, confidence score, sources, and timestamps, while a downstream analysis agent rejects the handoff if the confidence is below 0.7 — catching bad data immediately instead of letting “garbage” poison the report several steps later.

Designing for Failure: Circuit Breakers and Sagas

Bhaumik then gets into the production-grade recovery patterns: circuit breakers and compensation. Circuit breakers open after repeated failures — say five in a row — so the system fails fast instead of hammering a broken agent, while saga-style compensation means every agent has both execute and compensate methods, allowing the orchestrator to walk backward and undo partial work if a later step fails.

The Databricks Blueprint for Running This in Production

He closes by stitching it all together into a reference architecture: LangGraph plus Mosaic AI Agent Framework for orchestration, Unity Catalog functions or models as agents, Delta Lake for immutable versioned state, and MLflow Traces for inputs, outputs, latency, and token usage. His final point lands hard: the glamorous demo is easy, but the “unsexy infrastructure work” — immutable state, rollback, tracing, circuit breakers — is what keeps systems from failing at 2 a.m. and is what actually creates business value.