
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Incident.io uses AI to debug its own AI SRE product — Laurence Jones says their production investigation system runs hundreds of telemetry queries across logs, metrics, traces, historical incidents, and code, which makes human-only debugging too slow and often intractable.
Their biggest practical unlock was turning debugging UIs into downloadable file systems — instead of forcing agents through bespoke interfaces, they export traces, prompt inputs, and system state into files that Claude Code can grep, inspect, and reason over far more effectively.
Evals work like AI unit tests, but raw production cases became too bulky for agents to handle — incident.io stores evals as YAML next to Go prompts, then built a CLI called evaltool so coding agents can list, edit, replace, and add test cases without blowing context limits.
They created an agent runbook for the full red-green prompt-fixing loop — the agent can reproduce a failure with a new eval, modify the prompt until it passes, check that existing evals still pass, and even consolidate the prompt so it doesn’t turn into an unmaintainable mess.
A single bad AI interaction may come from dozens of prompts and agents, not one obvious bug — Jones shows a chatbot graph with roughly 10 agents and around 50 components, plus investigation traces where each green block can hide hundreds of prompts and tool calls.
For large-scale monitoring, they batch thousands of investigations into AI analysis pipelines — a daily backtest might say “86% accurate RCA,” but they use parallel Claude Code agents, cohort clustering, and markdown playbooks to explain why performance changed and what code or prompts to fix.
Laurence Jones, founding engineer at incident.io, opens by explaining that the company doesn’t just want to help teams respond to incidents for customers like Netflix, Etsy, and Skyscanner — it wants to automate the investigation itself. Their AI SRE system kicks off at incident start, runs through logs, metrics, traces, historical incident data, and even the codebase, then tries to say: here’s the likely problem, and here’s what to do next.
The catch is obvious once he says it out loud: how do you know whether one of these AI-generated investigations is actually good? To judge it properly, a human often has to spend an hour reconstructing the incident from the timeline and postmortem, and that just doesn’t scale when there are hundreds or thousands of prompts under the hood.
Jones frames evals as unit tests for prompts: input goes in, output comes out, and grading criteria decide pass or fail. At incident.io, these live as YAML files next to Go prompts, but realistic test data quickly becomes unwieldy — especially when one failure only shows up with something as big as an almost-complete incident record.
They first added a button to “steal an eval from production,” which helped, but those giant production-derived YAMLs became miserable for both humans and coding agents. Their fix was a CLI called evaltool that lets agents ask simple questions like what test cases exist, add one, replace one, or edit one — which made it possible to write an actual runbook for agents to follow.
Once agents could manipulate evals cleanly, they could do the whole workflow: reproduce the failure, patch the prompt, verify the new eval passes, then make sure older evals still pass too. Jones highlights one extra step that feels very lived-in: after repeated edits, prompts get bloated, so they ask the agent to consolidate and simplify the prompt at the end.
Even with a good eval loop, you still need to know where the bug lives. Jones shows their chatbot as a sprawling graph with around 10 agents and dozens of prompts, tools, and subcomponents, then says their investigations are even worse — each visible step in the UI can expand into hundreds of prompts and tool calls, so one subtle mistake can poison the whole diagnosis.
Their debugging tools were useful for humans, but not for agents, so they started downloading the whole interaction as a file system and dropping it into a Claude Code sandbox. That changed the workflow: instead of clicking around a UI, they can say, “This behaved badly — what went wrong?” and the agent can inspect the full trace, read the codebase, and point to the exact place to modify.
At scale, they run thousands of investigations daily across hundreds of customer accounts, so a rolled-up number like “86% accurate RCA” isn’t enough. Their answer is a repo called scrapbook with markdown playbooks that spin up maybe 25 sub-agents in parallel, analyze investigations one by one, cluster failures into cohorts, and produce a report that explains not just what broke, but why this customer’s system is performing well or badly and what to change next.
Jones ends with a simple pattern he thinks generalizes: if your AI product is complex, your internal debugging tools need to be just as AI-native. His strongest claim is that file systems are “exceptionally good agent context” — better than fancy interfaces — and that AI runbooks for repeatable analysis can save “literally days or maybe weeks” of work.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.