Braintrust CEO: Evals are the new PRD for AI products
TL;DR
Evals are the new PRD: Goyal says AI shifts programming from defining the how to defining the what, so the core product job becomes writing quantifiable success criteria and examples instead of prose-only specs.
Agents can handle serious systems work, not just CRUD apps: At Braintrust, agents run continuous experiments on database internals like Bloom filters, Tantivy column stores, EC2 to S3 latency, and query patterns across billions of traces over 90 days.
The real quality gain is rigor, not magic code generation: Goyal argues no human staff engineer will manually run as many benchmarks, compare as many algorithms, or keep testing edge cases as relentlessly as an agent guided by strong evals.
Taste scales when you encode it: Braintrust uses evals to capture designer David's judgment, then applies his quality bar across more outputs so his taste matters more, not less.
Maker time matters more in the agent era: Goyal keeps mornings for meetings, afternoons for coding, and runs roughly four to six foreground agents in parallel through tmux sessions, with heavier experiments offloaded to remote compute.
If the agent is flailing, fix the eval and restart: His default recovery move is not to argue with the model but to close the session, improve the scoring setup, and try again from scratch, especially after watching a vibe-coded 3,000-line eval script turn into junk.
The Breakdown
Braintrust CEO Ankur Goyal argues that evals have become the new PRD for AI products, and that coding agents are already good enough to tackle gnarly database and infrastructure work if you define success rigorously. His case is blunt: there is now "no excuse to not have rigor" when an agent can spend days benchmarking Bloom filters, column stores, and latency tradeoffs that no staff engineer would test by hand.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.