Back to Podcast Digest
AI Engineer18m

The maturity phases of running evals — Phil Hetzel, Braintrust

TL;DR

  • Start with vibes, but document the reasons — Hetzel says early evals can absolutely begin as manual “vibe checks,” as long as a human annotator or SME records not just thumbs up/down but the justification behind each judgment.

  • Evals are about failure modes, not exhaustive coverage — Unlike unit tests, agent evals should target the most important ways an agent can fail, because trying to enumerate every possible failure is effectively infinite and kills shipping velocity.

  • Production traces are the gold-standard eval dataset — Hetzel argues teams should stop thinking of evals as synthetic tests and instead “rerun production,” pulling in real traces or at least UAT-level interactions to measure quality against actual usage.

  • LLM-as-judge is useful, but it also needs evals — Braintrust uses LLM judges to scale human expertise, but Hetzel warns that “putting a robe and cloak on an LLM” does not make it trustworthy; you still need ground-truth datasets and validation against human decisions.

  • Tool-using agents force you to evaluate whole traces, not just outputs — Once agents call APIs, databases, MCPs, or CRUD systems, the evaluation problem expands to system state, tool-call behavior, token and cost constraints, and whether offline replay can safely simulate the original environment.

  • The next frontier is automatic failure discovery — Hetzel points to topic modeling over production traces and CLI-driven automated eval workflows as emerging patterns for finding new failure modes and operationalizing evals continuously.

The Breakdown

“Think about evals like rerunning production” is Phil Hetzel’s core advice: the path from vibe checks to mature agent evals starts with human thumbs-up/thumbs-down judgments, then scales into LLM judges, trace-level analysis, and production-derived datasets. His bigger point is that evals and observability are really the same system viewed at different times—before launch to gain confidence, and after launch to keep it.

Share