Back to Podcast Digest
AI Engineer1h 20m

Playground in Prod - Optimising Agents in Production Environments — Samuel Colvin, Pydantic

TL;DR

  • Samuel Colvin’s core pitch is that “AI observability” won’t stay a standalone category — he says Logfire is really a general observability platform with logs, metrics, traces, evals, and managed variables, even if customers currently buy it as “AI observability.”

  • Jepper is basically prompt optimization by evolutionary search, not magic — Colvin explains it as breeding the best racehorses: keep the strongest prompts on the Pareto frontier, mix them, test them, and iterate toward better strings or JSON configs.

  • A better prompt moved his MP-family-dynasty extraction task from roughly 87% to 92%, and Jepper pushed it to 96.7% — the task was extracting politicians’ ancestor relations from MPs’ Wikipedia pages, with the main failure mode being models incorrectly including spouses, children, and public figures.

  • The hard part isn’t running evals — it’s deciding what “correct” means — Colvin repeatedly comes back to the need for deterministic judges or golden datasets, warning that LLM-as-a-judge can become “the lunatics running the asylum.”

  • Managed variables let you change prompts, models, and other agent config in production without redeploying — he demoed swapping an app’s reply language from English to French to German, and even switching the model from Anthropic Sonnet 4.5 to OpenAI GPT-4.1 live through Logfire.

  • Prompt optimization matters most when the data is private or domain-specific, not when frontier models already know the task — Colvin argues that public demo tasks are actually bad examples, because real value shows up when you need a cheaper or smaller model to work on internal bank specs, invoice corpora, or other data absent from pretraining.

The Breakdown

Why Pydantic Is Now More Than Validation

Samuel Colvin opens by reminding the room he’s the creator of Pydantic, but quickly broadens the frame: today the company maintains Pydantic Validation, Pydantic AI, and Logfire. His sharpest early take is that he doesn’t really believe in “AI observability” as a lasting category — it’s just observability with some AI-specific features like evals and managed variables.

Jepper, Explained Like Horse Breeding

He introduces Jepper as a genetic optimization library that searches for better prompts or config strings. The memorable metaphor is racehorses: you don’t breed the slow horse back in, you keep mixing the best performers from the Pareto frontier and hope to get something faster. He’s candid that even he only “sort of” understands it, which sets the tone for a very practical, non-hand-wavy demo.

The MP Dynasty Problem He Actually Sent to a Podcast

The task comes from a real side project: Colvin used Pydantic AI to scan Wikipedia pages for UK MPs and estimate how many came from political families after hearing the question discussed on The Rest Is Politics. He got an answer around 24%, sent it in, and the show read his question — but he never really checked whether the agent had done a good job, so this talk becomes a belated audit.

Structured Outputs Work Shockingly Well — Until Relations Get Tricky

The agent takes cleaned Wikipedia text and returns structured political relations, similar in shape to extracting invoice lines or addresses from documents. Colvin says even “relatively dumb models” are surprisingly good, but they consistently stumble on two things: whether someone is truly a politician versus just a public figure, and whether a relation is an ancestor rather than a spouse, child, or sibling.

Evals in Logfire: Useful, Messy, and Better Than Vibes

He walks through loading a golden dataset of MP relations, running evals in parallel, and inspecting accuracy inside Logfire. The first prompt lands around 85%, and a more detailed “expert” prompt gets to roughly 92% versus 87% in another comparison run. His blunt warning: custom deterministic evaluators are far better than LLM-as-a-judge, which he jokes can become “the lunatics running the asylum.”

Jepper Optimization: Expensive, Crude, and Effective

Then he wires Jepper into the loop with a Pydantic AI proposer agent that suggests improved prompts, evaluates them, and iterates. He’s refreshingly honest about the tooling rough edges — Jepper is sync, not very type-safe, and he’s tempted to fork it — but the process still works, ultimately producing a verbose optimized prompt that scores 96.7%. His point is that the technique is state-of-the-art, but conceptually simple: generate prompt, test it, keep the better bits.

The Q&A Gets to the Real Limits

The audience pushes on overfitting, variance, systematic errors, and model dependence, and Colvin doesn’t dodge. One attendee spots the optimized prompt wrongly excluding aunts and uncles; Colvin agrees that’s likely overfitting to a limited subset. He also says prompt optimization often isn’t worth it for frontier models on public tasks, but absolutely can be worth it when a private-equity firm can save millions by getting a cheaper model to classify 200 million invoices.

Managed Variables: Playground-in-Prod Without Redeploys

The final demo shows Logfire managed variables controlling an app in production-like conditions. He changes the live system prompt so the same endpoint starts replying in French, then German, and swaps the backend model — all without redeploying the app. The bigger vision is obvious: connect evals, optimization, and managed variables so the platform can eventually tune agents automatically instead of making humans babysit every prompt change.

Share