Back to Podcast Digest
AI Engineer18m

Lessons from Trillion Token Deployments at Fortune 500s — Alessandro Cappelli, Adaptive ML

TL;DR

  • Cappelli’s core claim is that reinforcement learning is the missing layer between a flashy MVP and a real production system — he says 95% of GenAI pilots never make it to production because prompting and instruction fine-tuning don’t let teams systematically absorb defects and improve behavior over time.

  • RL changes the token economics by letting enterprises get frontier-like performance from smaller models — Cappelli argues RL can match SFT performance with much smaller models, which matters when companies like AT&T spend millions just summarizing customer-agent transcripts.

  • Latency is not a nice-to-have; it’s a production gate — for speech-to-speech customer support, he says anything above roughly 0.5 seconds already feels awkward and the real target is closer to one-third of a second, pushing teams toward smaller models like Gemma, Mistral, and Qwen-class 10B systems.

  • Agents make the production problem much harder, and that’s exactly where RL fits best — because agents consume more tokens, touch databases, and operate in environments, Cappelli frames RL as the natural training method since it was originally built for agent-like behavior in environments.

  • Enterprise agent data often doesn’t exist upfront, but RL can generate it as a byproduct — with an environment and reward function in place, teams can create synthetic trajectories via rejection sampling, bootstrap training data, and use real transcripts to train realistic mock users, including messy edge cases like panicked medical supply customers.

  • Adaptive ML is pitching an 'RL Ops' stack that hides the ugly complexity of training — Cappelli notes PPO can require orchestrating four LLMs at once, so Adaptive Engine packages evaluation, tuning, and serving into one platform with prebuilt recipes instead of asking customers to implement algorithms like PPO or GSPO themselves.

The Breakdown

The real problem isn’t building the demo — it’s surviving the marathon to production

Alessandro Cappelli opens by introducing Adaptive ML as an “RL ops” platform used by enterprises like AT&T, Manulife, and CCS, then makes his main point fast: reinforcement learning isn’t just another post-training trick, it’s the mechanism that actually gets models into production. Drawing on his experience training Falcon three years ago, he says the missing gap between open-source models and proprietary frontier systems was RL.

The “last mile” story is a myth, and that myth kills most GenAI projects

Cappelli says 95% of GenAI pilots fail because teams treat production like a simple final step after the MVP. In his version, the MVP is just the first mile; the hard part is the long loop of retraining and refinement after defects show up in the real world. His critique is blunt: with proprietary models you mostly just tweak the system prompt, and with instruction fine-tuning you keep rebuilding datasets — neither gives you a scientific, monitorable improvement loop.

Why RL wins on cost, speed, and control

He argues RL is disproportionately more effective than prompting or SFT because it can reach similar performance with a much smaller model. That matters in enterprise settings where token costs explode: AT&T, he says, spends millions simply summarizing customer support transcripts. Smaller models also unlock latency targets that are non-negotiable in production — for voice systems, even half a second feels weird, and one-third of a second is the real bar.

Agents raise the stakes, and RL’s original design suddenly looks very relevant

Once the conversation turns to agents, Cappelli says everything gets harder: more tokens, more complexity, less room for error, and direct access to systems and databases. His framing is memorable and simple — RL was built to train robots and agents in environments, so it naturally fits the new world where LLMs have to act, not just answer.

If the agent workflow exists, plug into it; if not, mock the world

He gives Manulife as the example of a company that already had agent workflows, making it possible to drop in a model and train directly inside the existing environment. If the environment doesn’t exist yet, he says you can still mock the tools and even mock the user with another LLM, then define success through business KPIs, LLM-as-judge evaluations, or concrete questions like whether the tone matched business guidelines.

RL quietly solves the “we don’t have the data” problem by generating it

Cappelli says one of the biggest customer objections is lack of training data, especially for agents, because there’s no giant web-scraped dataset of tool-using enterprise agents. His workaround is elegant: once you have an environment and reward function, you’ve effectively built a synthetic data pipeline, producing trajectories you can filter with rejection sampling and use to bootstrap training. He adds a very human touch here, noting that real enterprise transcripts can train mock users to behave like actual customers — including people who repeat themselves, get annoyed, or call a medical supply company in a panic.

Human feedback matters, but he doesn’t want annotation campaigns to be the bottleneck

He pushes back on the romanticized version of RLHF, saying “human in the loop” often really means expensive annotation work that nobody wants to run. In his setup, humans are mainly used to define rubrics, judge prompts, and scenarios; the heavy lifting comes from systematic rewards, business metrics like CCS’s containment rate, and LLM judges. In the Q&A, he says small amounts of human feedback are first used to improve judging, then once production yields thousands of signals, Adaptive ML trains reward models to scale that feedback.

Adaptive ML’s pitch: make RL usable without forcing teams to become RL researchers

The talk ends on the practical catch: RL is hard. Cappelli points out that PPO can require orchestrating four LLMs at the same time, which is exactly why Adaptive built the Adaptive Engine — a platform that combines observation, training, evaluation, and serving, while exposing prebuilt recipes instead of making customers implement algorithms from scratch. His closing message is consistent with the whole talk: RL is the algorithm that industrializes the path from model experiment to production system.

Share