AI EngineerMay 12, 202615m

Dark Factory: OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc, Comet ML

TL;DR

Static evals are breaking because AI systems no longer stay still — Vincent Koc argues that agentic apps, adaptive harnesses like OpenClaw, and fast-shipping AI software make fixed benchmarks feel like testing a moving target with yesterday’s assumptions.
AI needs its version of chaos engineering, not just benchmark worship — drawing on his Comet work with customers like Uber, Netflix, and UK banks, he says the industry over-indexes on handcrafted offline tests and under-invests in probing where agents actually fail in the wild.
The shift is from prompt engineering to intent engineering — after the era of “doom scroll, wordsmith instructions,” then context engineering with RAG and tool calling, Koc says 2025 is about systems that infer user intent and self-optimize toward outcomes.
Evaluations should optimize for end states, not canned answers — instead of checking “1 + 1 = 2” style outputs, he wants rubrics for ambiguity, personality, and business goals, where traces and telemetry continuously regenerate what gets tested.
The dangerous part is the changing 20%, not the stable 80% — most agent behavior may look repetitive, but it’s the weird new customer query or strange usage pattern that can wreck the business, so evals need to adapt as those edge cases emerge.
Telemetry-aware agents can start healing themselves — Koc points to harnesses that notice errors, cost spikes, or failures and self-correct, framing evals as living software or even agents themselves rather than frozen datasets.

Summary

From vomiting in early VR to loving janky systems

Vincent Koc opens with a story that tells you exactly who he is: he wore 2013-era VR goggles for three hours even though the warning label said five minutes, then spent three hours vomiting afterward. His point is that life on the edge of technology is always a little broken and weird — and measurement has to account for that, not pretend systems are clean and stable.

Why “evals are dead” is both a joke and a warning

At Comet, Koc works on eval research and benchmarking with universities and companies ranging from Uber to Netflix to UK banks, so he’s not anti-evals. But he says the industry’s fixation on static benchmarks has left a huge gap: unlike software engineering, where unit tests sit alongside observability and chaos engineering, AI still mostly relies on handcrafted question sets and offline checks.

The benchmark obsession misses how agents actually fail

He skewers the conference pattern everyone recognizes: endless benchmark papers that prove a model can do some narrow thing, without helping anyone understand a production agent. The result is giant datasets that feel reassuring until something goes wrong in the real world — and because AI apps are malleable, not static, failure is less an exception than an inevitability.

OpenClaw and the problem of software that rewrites its own harness

Koc references OpenClaw, which he contributes to, as proof that even the testing harness is now changing itself. If the harness adapts as skills and capabilities evolve, then traditional benchmarks can’t keep pace; the test itself has to become adaptive too, which is why he highlights emerging work on adaptive testing for LLM evals.

From prompt hacking to context engineering to intent engineering

He describes prompt engineering with real contempt and humor — “doom scroll, wordsmith instructions,” just smashing words into models and hoping the output improves, like accidentally discovering a painkiller while trying to treat liver disease. Context engineering made things more tractable because RAG, tool calling, and MCP-based decomposition let teams test pieces of a larger agent, but in 2025 he thinks the real shift is toward intent engineering, where machines adapt to what the user is actually trying to achieve.

Better models make evals harder, not easier

Part of the confusion, he says, is that many people still haven’t internalized how capable models have become. He points to optimization work and ARC-AGI-style puzzles as examples where models can pattern-match on problems that are genuinely difficult for humans, which means personalized, adaptive behavior is becoming normal — and evals now have to answer how your experience differs from mine and whether both are still “correct.”

The new eval stack: traces, rubrics, always-on optimization

His alternative is to move from static answer keys to intent-based outcomes: evaluate ambiguity, personality, and business goals with rubrics, and let traces from real usage generate new suites automatically. He wants online, always-on evals fed by telemetry, where agents notice what’s changing in customer behavior, what’s breaking, and what it’s costing — then update the tests and even self-correct.

The real threat is the weird 20%

Koc closes on a memorable framing: maybe 80% of agent behavior is the stable, known part, but the 20% that keeps changing is what can blow up your business. His thesis is that evals should stop being treated as frozen datasets and start being treated as code, software, or even living agents — self-optimizing systems defined by the end state you want, not a static set of examples from the past.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Dark Factory: OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc, Comet ML

Summary

From vomiting in early VR to loving janky systems

Why “evals are dead” is both a joke and a warning

The benchmark obsession misses how agents actually fail

OpenClaw and the problem of software that rewrites its own harness

From prompt hacking to context engineering to intent engineering

Better models make evals harder, not easier

The new eval stack: traces, rubrics, always-on optimization

The real threat is the weird 20%

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

From vomiting in early VR to loving janky systems

Why “evals are dead” is both a joke and a warning

The benchmark obsession misses how agents actually fail

OpenClaw and the problem of software that rewrites its own harness

From prompt hacking to context engineering to intent engineering

Better models make evals harder, not easier

The new eval stack: traces, rubrics, always-on optimization

The real threat is the weird 20%

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks