AI EngineerMay 17, 202617m

Fighting AI with AI — Lawrence Jones, Incident

TL;DR

Incident.io uses AI to debug its own AI SRE product — Laurence Jones says their production investigation system runs hundreds of telemetry queries across logs, metrics, traces, historical incidents, and code, which makes human-only debugging too slow and often intractable.
Their biggest practical unlock was turning debugging UIs into downloadable file systems — instead of forcing agents through bespoke interfaces, they export traces, prompt inputs, and system state into files that Claude Code can grep, inspect, and reason over far more effectively.
Evals work like AI unit tests, but raw production cases became too bulky for agents to handle — incident.io stores evals as YAML next to Go prompts, then built a CLI called evaltool so coding agents can list, edit, replace, and add test cases without blowing context limits.
They created an agent runbook for the full red-green prompt-fixing loop — the agent can reproduce a failure with a new eval, modify the prompt until it passes, check that existing evals still pass, and even consolidate the prompt so it doesn’t turn into an unmaintainable mess.
A single bad AI interaction may come from dozens of prompts and agents, not one obvious bug — Jones shows a chatbot graph with roughly 10 agents and around 50 components, plus investigation traces where each green block can hide hundreds of prompts and tool calls.
For large-scale monitoring, they batch thousands of investigations into AI analysis pipelines — a daily backtest might say “86% accurate RCA,” but they use parallel Claude Code agents, cohort clustering, and markdown playbooks to explain why performance changed and what code or prompts to fix.

Summary

The real goal: automate production investigations

Laurence Jones, founding engineer at incident.io, opens by explaining that the company doesn’t just want to help teams respond to incidents for customers like Netflix, Etsy, and Skyscanner — it wants to automate the investigation itself. Their AI SRE system kicks off at incident start, runs through logs, metrics, traces, historical incident data, and even the codebase, then tries to say: here’s the likely problem, and here’s what to do next.

Why evaluating this stuff gets brutally hard

The catch is obvious once he says it out loud: how do you know whether one of these AI-generated investigations is actually good? To judge it properly, a human often has to spend an hour reconstructing the incident from the timeline and postmortem, and that just doesn’t scale when there are hundreds or thousands of prompts under the hood.

Evals as AI unit tests — and where they break

Jones frames evals as unit tests for prompts: input goes in, output comes out, and grading criteria decide pass or fail. At incident.io, these live as YAML files next to Go prompts, but realistic test data quickly becomes unwieldy — especially when one failure only shows up with something as big as an almost-complete incident record.

The small CLI that made agents useful on evals

They first added a button to “steal an eval from production,” which helped, but those giant production-derived YAMLs became miserable for both humans and coding agents. Their fix was a CLI called evaltool that lets agents ask simple questions like what test cases exist, add one, replace one, or edit one — which made it possible to write an actual runbook for agents to follow.

Let the coding agent run the red-green loop

Once agents could manipulate evals cleanly, they could do the whole workflow: reproduce the failure, patch the prompt, verify the new eval passes, then make sure older evals still pass too. Jones highlights one extra step that feels very lived-in: after repeated edits, prompts get bloated, so they ask the agent to consolidate and simplify the prompt at the end.

The bigger problem: modern AI systems are prompt mazes

Even with a good eval loop, you still need to know where the bug lives. Jones shows their chatbot as a sprawling graph with around 10 agents and dozens of prompts, tools, and subcomponents, then says their investigations are even worse — each visible step in the UI can expand into hundreds of prompts and tool calls, so one subtle mistake can poison the whole diagnosis.

The big unlock: export the UI as a file system

Their debugging tools were useful for humans, but not for agents, so they started downloading the whole interaction as a file system and dropping it into a Claude Code sandbox. That changed the workflow: instead of clicking around a UI, they can say, “This behaved badly — what went wrong?” and the agent can inspect the full trace, read the codebase, and point to the exact place to modify.

From one-off debugging to repeatable AI analysis pipelines

At scale, they run thousands of investigations daily across hundreds of customer accounts, so a rolled-up number like “86% accurate RCA” isn’t enough. Their answer is a repo called scrapbook with markdown playbooks that spin up maybe 25 sub-agents in parallel, analyze investigations one by one, cluster failures into cohorts, and produce a report that explains not just what broke, but why this customer’s system is performing well or badly and what to change next.

The closing thesis: build internal AI tools like product features

Jones ends with a simple pattern he thinks generalizes: if your AI product is complex, your internal debugging tools need to be just as AI-native. His strongest claim is that file systems are “exceptionally good agent context” — better than fancy interfaces — and that AI runbooks for repeatable analysis can save “literally days or maybe weeks” of work.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Fighting AI with AI — Lawrence Jones, Incident

Summary

The real goal: automate production investigations

Why evaluating this stuff gets brutally hard

Evals as AI unit tests — and where they break

The small CLI that made agents useful on evals

Let the coding agent run the red-green loop

The bigger problem: modern AI systems are prompt mazes

The big unlock: export the UI as a file system

From one-off debugging to repeatable AI analysis pipelines

The closing thesis: build internal AI tools like product features

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

The real goal: automate production investigations

Why evaluating this stuff gets brutally hard

Evals as AI unit tests — and where they break

The small CLI that made agents useful on evals

Let the coding agent run the red-green loop

The bigger problem: modern AI systems are prompt mazes

The big unlock: export the UI as a file system

From one-off debugging to repeatable AI analysis pipelines

The closing thesis: build internal AI tools like product features

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks