
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Start with vibes, but document the reasons — Hetzel says early evals can absolutely begin as manual “vibe checks,” as long as a human annotator or SME records not just thumbs up/down but the justification behind each judgment.
Evals are about failure modes, not exhaustive coverage — Unlike unit tests, agent evals should target the most important ways an agent can fail, because trying to enumerate every possible failure is effectively infinite and kills shipping velocity.
Production traces are the gold-standard eval dataset — Hetzel argues teams should stop thinking of evals as synthetic tests and instead “rerun production,” pulling in real traces or at least UAT-level interactions to measure quality against actual usage.
LLM-as-judge is useful, but it also needs evals — Braintrust uses LLM judges to scale human expertise, but Hetzel warns that “putting a robe and cloak on an LLM” does not make it trustworthy; you still need ground-truth datasets and validation against human decisions.
Tool-using agents force you to evaluate whole traces, not just outputs — Once agents call APIs, databases, MCPs, or CRUD systems, the evaluation problem expands to system state, tool-call behavior, token and cost constraints, and whether offline replay can safely simulate the original environment.
The next frontier is automatic failure discovery — Hetzel points to topic modeling over production traces and CLI-driven automated eval workflows as emerging patterns for finding new failure modes and operationalizing evals continuously.
“Think about evals like rerunning production” is Phil Hetzel’s core advice: the path from vibe checks to mature agent evals starts with human thumbs-up/thumbs-down judgments, then scales into LLM judges, trace-level analysis, and production-derived datasets. His bigger point is that evals and observability are really the same system viewed at different times—before launch to gain confidence, and after launch to keep it.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.