How I AIJune 15, 202640m

How this startup uses AI agents to eliminate bugs and optimize infrastructure

TL;DR

Agents can out-benchmark staff engineers on hard infrastructure work: Ankur says no human is manually running as many rigorous benchmarks as an agent, and Braintrust used continuous experiments across index types, column stores, and execution engines to improve query performance on billions of traces over 90 days.
The key is defining success, not dictating implementation: Instead of hand-picking every algorithm, Braintrust reproduces slow real-world queries, sets tests and success criteria, and lets coding agents explore ideas from database literature like Bloom filters, Tantivy alternatives, and column store formats.
"The agent line" is becoming a management skill: Ankur’s test is simple: if the information in a meeting could be given to an agent to solve the same problem, it belongs below the agent line, which is why he avoids meetings after noon and keeps 4 to 6 foreground agents running in parallel.
Evals are the modern PRD: Ankur frames evals as a shift from telling systems how to do something to specifying what success looks like, then quantifying it with examples, scoring functions, and repeated runs instead of relying only on one-off vibe checks.
Taste can be systematized without replacing the expert: Braintrust uses a designer named David as the final vibe-checker, then translates his feedback into better scorers so his taste applies to more outputs, raising the quality bar instead of making him less valuable.
If agents are creating chaos, improve CI and reset bad sessions: Braintrust spends more time on CI so faster shipping does not mean more broken code, and when an agent goes off track Ankur’s default move is blunt: close the session, improve the eval, and start fresh.

The Breakdown

A week of nonstop agent-run experiments led Braintrust to a Bloom filter index that sped up hard database queries, and CEO Ankur Goyal argues there is now "no excuse to not have rigor" in engineering. The bigger idea is that coding agents plus strong evals let teams tackle infrastructure, product quality, and even taste-driven design work at a scale humans rarely sustain manually.