Back to Podcast Digest
AI Engineer25m

How Google DeepMind Runs Agents at Scale — KP Sawhney & Ian Ballantyne, Google DeepMind

TL;DR

  • DeepMind’s internal agent harness is already more than a coding chat UI — Ian Ballantyne showed Antigravity as an IDE-integrated agent manager that can spawn multiple agents, inspect the DOM, control a browser, generate plans and to-dos, and return artifacts like screenshots, videos, and end-of-run reports.

  • The real scaling problem is token burn, not just model quality — KP Sawhney said the top issue at Google scale is how “token hungry” agent systems are, which forces strict per-user and per-team quota management and pushes teams toward mixing cheaper models like Gemma 4 with premium models only where they matter.

  • DeepMind wants agents to collaborate through shared workspaces instead of passing giant context blobs — Sawhney described rethinking deep research so pipeline components work like human collaborators in a shared filesystem, which could cut cost, reduce context pressure, and unlock richer outputs like infographics and support docs.

  • Observability is custom and deeply hierarchical because agent failures are weird — internally, Google uses a custom web app that lets teams drill from a user query down to raw model predict requests, plus an “agent trajectory store” to pinpoint where looping or derailment began in long coding runs.

  • Google’s internal agent ecosystem is being managed almost like natural selection — Sawhney said DeepMind is building a huge library of reusable “skills,” but in an org as large as Google the challenge is preventing sprawl so only the best skills survive in what he called an almost “Darwinian” process.

  • Even inside Google, power users still hit the wall — both speakers were candid that today’s answer to runaway usage is often brute-force quota enforcement, with one-user-takes-down-the-system risk, SRE intervention, and a likely future where harnesses automatically downgrade across models instead of silently stalling when limits are hit.

The Breakdown

Antigravity opens with a live demo — and some very real Wi-Fi chaos

Ian Ballantyne kicks things off by showing that Antigravity isn’t just a VS Code-style interface: behind it sits a full agent manager that can run multiple agents across projects. The demo briefly faceplants on connectivity, prompting a joking plug to “use the Gemini models instead,” but once it recovers, the system starts analyzing a spec, building a plan, and taking over a browser to validate the result.

The harness acts like a junior engineer you can interrupt mid-task

What stands out in the demo is how much scaffolding sits around the model: built-in to-dos, planning, scratchpad notes, DOM inspection, and a final report with what was or wasn’t accomplished. Ian emphasizes the human-in-the-loop part — you can edit the generated plan, correct the agent’s interpretation, then hit proceed instead of just hoping the black box got it right.

KP’s job: turn a successful research agent into infrastructure that works everywhere

Sawhney says he worked on Deep Research, now available through the Interactions API, but his focus has shifted to making the Antigravity harness robust enough for broad internal use. In practice that means supporting Google’s massive monorepo for coding today while generalizing the same harness to other workflows, including research tasks where shared files may work better than shuttling huge text payloads between steps.

Inside Google, the “skills” layer is exploding — and needs discipline

Asked how people at DeepMind use agents now, Sawhney says there’s a big push to build a library of reusable skills that help people do their jobs faster. But he’s equally blunt about the downside: in an organization this large, skills can sprawl out of control, so the team is trying to improve them and let only the strongest survive in an “almost Darwinian” way.

At Google scale, the bottleneck is quota, compute, and runaway power users

The conversation shifts from flashy demos to operational pain: agent systems are incredibly token hungry, and quota management per user and per team is critical. Sawhney says evaluation is also expensive, so they’re looking at things like mock TPUs to test harness behavior without burning real TPU hours, while Ian frames the practical fear clearly: what happens when one power user spins up 100 agents and starts melting the system?

The current answer is blunt: quota first, pricing later

Sawhney admits the immediate answer is often brute force — some internal users simply get told to stop. He connects that to a broader industry issue, pointing to Anthropic blocking OpenClaw-style usage and arguing that flat subscriptions don’t map well to token-intensive agent behavior.

Debugging agents requires custom observability all the way down to raw predicts

On observability, Sawhney says Google built a custom UI around a shared internal agent backend. Teams can drill from a user query through every layer of the system down to raw predict requests, and for coding they also maintain an “agent trajectory store” to inspect long action sequences and catch the exact moment a model starts looping or goes off the rails.

The future they hint at: collaborative agents, seamless failover, and agent review of agent code

In the final stretch, both speakers sketch what’s next: deep research reworked as collaborating agents in a workspace, agent-to-agent communication that humans supervise like a “digital assembly line,” and harnesses that fail over between premium, flash, and local models without interrupting work when quota runs out. They end on code review, noting Google already has per-language auto-review models tuned to internal style guides — and that agent-generated PR comments are already landing with useful suggestions before humans even ask.

Share