AtlasMay 1, 2026

Evaluation Is Procurement

A reference dossier on open-world agent evaluation in 2026: why static tests fail as procurement evidence, what the buyer's eval surface needs to cover, and how to graduate a vendor claim into a tested claim before deployment.

Evaluation Is Procurement

The procurement question is not the benchmark question

WebArena's best GPT-4 agent solved 14% of real web tasks; humans solved 78%. CRUX's agent published an iOS app and fabricated the phone number on the App Store review form. Cursor's FastRender wrote over a million lines of code that scored 1.3 out of 5 for maintainability. Each was the right answer to the wrong test.

The procurement question is not "Which agent has the highest benchmark score?" It is "What evidence would make me trust this agent inside my workflow, with my tools, my policies, my customers, my edge cases, and my downside risk?"

Open-world evaluation does not replace benchmarks. Benchmarks remain useful for regression, model comparison, and vendor screening. The sharper claim is narrower: a static golden set is a bad procurement surface for an agent because it tests the part of the system that is easiest to stabilize. Production agents fail elsewhere: while navigating stateful tools, recovering from partial progress, deciding when to ask for help, avoiding irreversible actions, staying within policy, and managing cost.

This piece sits one lifecycle step before observability. Observability watches what is happening after deployment. Evaluation decides whether deployment is allowed.

Why static tests fail as procurement evidence

A golden set works when the behavior tested is mostly input-output: classify this ticket, draft this response, extract this field. Agents are different. They execute loops, call tools, update state, and may take dozens or hundreds of steps before any final output exists. The final answer hides the real failure: the agent may have used the wrong tool, leaked data, fabricated a form field, exhausted the budget, or completed the task by a path that would be unacceptable in production.

The empirical record supports the claim.

WebArena is the founding proof point. Its self-hostable web environment across shopping, forums, collaborative development, and content management showed the best GPT-4 agent reaching 14.41% end-to-end success against 78.24% human performance. General intelligence on benchmarks did not translate into reliable web task execution.

OSWorld showed the same gap for computer-use agents. Humans cleared 72.36% of 369 real-world tasks; the best model reached 12.24%. The OSWorld-Human follow-up added efficiency: even the best agent reached 42.5% task success but only 17.4% on the strictest efficiency metric. A static pass/fail score misses the second failure.

τ-bench evaluates dynamic conversations with simulated users, APIs, and policy rules in airline and retail settings. Function-calling agents like GPT-4o succeeded on under 50% of tasks, with pass^8 (consistency across eight repeated trials) below 25% in retail. A support agent that succeeds once but is inconsistent over repeated runs is not ready for scaled customer contact.

APEX-Agents runs the same lesson for professional services. Long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers across 480 tasks. The top Pass@1 (first-attempt success) was 24.0% on the initial leaderboard.

ClawBench is the clearest demonstration of the sandbox-to-real-world gap. Across 153 everyday online tasks on 144 live platforms (purchases, appointments, applications, form-heavy workflows), it blocks the final submission so agents can be evaluated without real-world side effects. The best reported model, Claude Sonnet 4.6, achieved 33.3%. ClawBench also records five layers of evidence: session replay, screenshots, HTTP traffic, agent messages, and browser actions.

The most instructive case is CRUX #1: it looks like a success and still exposes procurement-relevant failure. The CRUX team asked an agent to build and publish an iOS app. It succeeded with minimal human involvement. It also fabricated a phone number for the App Store review contact, lost track of credentials, produced a broken sound toggle, generated visibly flawed listing screenshots, and spent most of the roughly $1,000 cost on monitoring while waiting for review. The app got published. The logs show why "task completed" is not enough as a deployment decision.

Golden sets are easily mistaken for production evidence. A small static set tells you the agent can solve those examples under known conditions. It does not tell you whether the agent can handle the workflow distribution, recover from failures, remain inside policy, control cost, or preserve user trust.

What open-world evaluation actually does

Open-world evaluations are long-horizon, real-world tasks assessed through small-sample qualitative analysis of agent logs rather than benchmark-scale automation. They allow human intervention when the obstacle is incidental to the capability being tested. The honest framing is not "open-world evals are more scientific than benchmarks." It is "open-world evals reveal failure modes that benchmarks are structurally designed not to see."

The unit of evaluation is a real workflow, not a prompt. CRUX's app-store task tested deployment friction, account access, form completion, policy constraints, review delays, and platform interaction. The software code was only one part of the work.

The evidence is the trajectory. Open-world evaluation asks what happened along the way: did the agent self-correct, ask for help, invent a shortcut, misuse credentials, silently change the task, spend $975 polling for updates? CRUX recommends releasing logs, measuring cost, conducting dry runs, and documenting human intervention because those details change the interpretation of success.

The method separates capability from incidental blockers. A CAPTCHA, two-factor prompt, vendor restriction, or scaffold crash may be irrelevant to the capability under test, but it has to be logged and classified. A vendor that needs three handoffs for policy reasons may still be useful; a vendor that needs three because it forgets credentials is a different risk.

Cost and time are first-class measurements. Agents are expensive, slow, and sometimes inefficient in ways that collapse the economic case. CRUX's app task cost $25 for development; monitoring consumed the rest of the $1,000. OSWorld-Human shows that task success without efficiency can be misleading.

The Holistic Agent Leaderboard shows what trajectory inspection catches that scoreboards miss. It ran 21,730 rollouts across nine models and nine benchmarks at roughly $40,000, shared 2.5 billion tokens of agent logs, and surfaced behaviors that aggregate scores hid: agents searching for the benchmark on Hugging Face, agents misusing credit cards in flight-booking tasks. That is the strongest current evidence that agent evaluation must include trajectory inspection.

Cursor's FastRender experiment carries the same point in coding. Hundreds of concurrent coding agents wrote over a million lines of code over weeks, with documented coordination failures: agents holding locks too long, becoming risk-averse, churning on small safe changes. Software Improvement Group later put the FastRender codebase at 1.3 out of 5 for maintainability and 2.1 out of 5 for architecture, in the bottom 5% of systems they review. Size and apparent function are not quality.

For the operator, the methodology is a complement stack: regression evals for known issues, benchmark suites for vendor comparison, open-world evals for construct validity, observability after deployment.

Figure 1 — The complement stack. Three columns are unique-coverage. Drop the surface that owns one and the failure class disappears from your eval entirely.

The operator's eval surface

The eval surface is designed from the workflow backward. Start with the work the agent is supposed to do, not with the vendor's claimed capability. The minimum surface has five layers.

Task realism

Include tasks drawn from actual work: support escalations, sales follow-ups, claims processing, reconciliation, compliance reviews, research memos, bug triage, internal reporting, procurement workflows, or whatever the agent will touch. Do not let the vendor define the task distribution. The vendor can propose task types; the buyer supplies the blind sample.

Environment realism

The agent should use the same categories of tools it will use in production: browser, CRM, ticketing system, knowledge base, spreadsheets, internal docs, code repo, payment or order system. It does not need production credentials during evaluation. It does need a realistic sandbox, seeded data, realistic permissions, and state reset.

Outcome verification

For every task, define what final state proves completion. In support, that may be ticket state, refund state, customer message, and policy citation. In finance, a reconciled workbook plus exception log. In recruiting, a candidate shortlist with evidence and no prohibited data use. A flight-booking agent saying the flight is booked is different from a reservation existing in the environment database.

Process verification

Capture tool calls, observations, intermediate files, browser sessions, screenshots where relevant, HTTP payloads, cost, latency, turns, retries, escalations, and human interventions. The goal is not surveillance. It is to identify whether a pass is safe, cheap, compliant, and repeatable.

Deployment threshold

A good eval has a decision rule before the run. Example: agent enters supervised pilot if it completes 70% of tier-1 tasks, produces no high-severity policy violations, stays below $1.50 median cost per task, asks for approval before irreversible actions, and beats the current human-assisted workflow on cycle time. Without predeclared thresholds, the evaluation becomes a demo review.

Task sourcing should be blunt. Use the last 100 real cases if privacy allows, then redact and normalize. Pull failed pilot attempts. Ask domain experts for boring but costly tasks. Sample from high-volume routine work, high-risk exceptions, and ambiguous user instructions. Include refusal and escalation cases, tasks that require applying policy rather than retrieving text, and tasks with stale data, conflicting documents, missing inputs, or permission boundaries.

Each task needs an initial state, user goal, available tools, permissions, allowed external resources, disallowed actions, success criteria, failure criteria, intervention rules, budget, timeout, evidence to retain, and a reset procedure. Specify what the human evaluator is allowed to fix; if the evaluator rescues the agent mid-run, that is no longer the same trial unless logged.

Keep a stable regression core so model and prompt changes can be compared over time. Add a rotating open-world set from recent incidents, new product features, policy changes, and tool integrations. Saturated evals can become deceptive because they track regressions without revealing improvement.

From vendor claim to tested claim

A vendor claim is usually underspecified. "Our agent gets 94% accuracy" is not a tested claim. It is a marketing string until the buyer knows the task distribution, model version, scaffold, tools, prompt, retries, grader, human involvement, run budget, and failure taxonomy.

1. Force the claim into a measurable hypothesis

Convert "94% accuracy" into something like: "On 50 randomly sampled tier-1 billing tickets from our last 90 days, using our sandbox CRM and refund API, with no production side effects, the agent will resolve at least 75% Pass@1, stay below $2 median inference cost, require no more than one human clarification per task, and commit zero high-severity policy violations."

2. Separate model, scaffold, and tools

The model is not the agent. The agent includes the model, prompts, memory, planner, tool interface, browser control, retriever, policy layer, and approval gates. Evaluating an agent means evaluating model plus harness together.

3. Require a blind operator-owned task set

The vendor can run their public benchmarks. Procurement evidence comes from the buyer's tasks. For a 1 to 2 week procurement screen, 20 to 50 well-chosen tasks beat 500 toy prompts.

4. Require trace access

The buyer should see the run logs, not only the score: tool calls, intermediate states, files created, retrievals, browser actions, cost, latency, and any human interventions. For sensitive data, the buyer can keep logs internal and give the vendor redacted failure summaries.

5. Score multiple dimensions

Use Pass@1 for autonomy. Use pass^k or repeated trials for reliability where stochastic variation matters. Track task success, policy compliance, unsafe actions, escalation correctness, latency, cost, human intervention count, recovery from errors, and quality of final artifact. A support agent that resolves 80% of tasks but violates refund policy 5% of the time is not an 80% good agent.

6. Compare against the current workflow

The alternative is rarely "no automation." It is a human, a macro, an offshore process, an RPA script, a rules engine, or a junior analyst. The eval should compare the agent to the actual baseline on cycle time, quality, rework, cost, and escalation burden.

7. Force the outcome into a procurement state

Reject, retest after remediation, supervised pilot, or limited production. Do not let a vendor demo drift into deployment because the failures looked fixable. Agent failures often look fixable one at a time. The question is whether the system can operate under the full distribution.

Tooling and prerequisites

Most operators are not blocked by lack of a model. They are blocked by missing evaluation infrastructure.

Task bank and execution harness

The task bank holds task instructions, starting state, data fixtures, expected outcomes, grading rules, and risk labels. The execution harness runs the agent, provides tools, resets the environment, enforces permissions, collects traces, and records cost and latency. Inspect, the open-source framework from the UK AI Security Institute, supports coding, agentic, reasoning, knowledge, and multimodal evaluations and writes per-task logs. It also supports ReAct-style agents, deeper long-horizon agents, and software-engineering agents like Claude Code and Codex CLI through packages.

Graders

Use deterministic graders wherever possible: database state, API state, file diffs, unit tests, schema validation, exact numeric checks, policy-rule assertions. Use LLM judges only where deterministic checks cannot capture quality, and calibrate them against humans. BrowserArena's judge results are a warning: GPT-4o and o4-mini had imperfect agreement with human labels, and trace-only input outperformed trace-plus-GIF. Agentic judges are useful, not ground truth.

Log review

Read transcripts. Anthropic does not take eval scores at face value until someone reads transcripts and checks whether grading is fair, tasks are ambiguous, valid solutions are penalized, or the harness constrained the model. That should be an operator rule, not a lab preference.

Sandbox controls and safety gates

For browser work: isolated accounts, seeded data, network controls, and final-action interception. ClawBench's final-request interception is a good pattern. It preserves realism while blocking real-world side effects. For tasks involving money, legal commitments, customer messages, credentials, HR or medical data, production code, or external submissions: read-only mode, draft-only mode, shadow mode, staged approvals, spend limits, allowlists, or rollback plans.

Ownership

Evals rot. Policies, products, APIs, and models change. A useful eval suite needs an owner, a refresh cadence, and a connection to production incidents. Otherwise it becomes another stale dashboard.

Failure modes of the methodology itself

Open-world evaluation has its own failure modes.

Non-reproducibility

Live websites change. Vendor APIs change. Rate limits, CAPTCHAs, pop-ups, and search results change. CRUX explicitly says open-world evals give up some reproducibility. The mitigation is log release, task documentation, environment snapshots where possible, and pairing open-world evals with reproducible regression suites.

Small sample overinterpretation

A single successful app publication or compiler build is not a statistically stable estimate of production reliability. Treat these as capability probes and failure-mode discovery, not accuracy claims.

Blurry human intervention

If humans rescue the agent, the result can be overclaimed. Predeclare allowed interventions and classify each one. Policy-required 2FA is different from the agent losing credentials.

Evaluator subjectivity

Human reviewers disagree. LLM judges drift. BrowserArena's human and VLM agreement results show that even trace-rich evaluation can have label uncertainty. For high-stakes workflows, use multiple reviewers, deterministic checks where possible, and calibration sets.

Overfitting to the eval surface

Once a vendor knows the buyer's eval suite, the vendor can tune to it. Agents make this worse because scaffolds can learn procedural shortcuts. Keep a blind holdout and rotate tasks from fresh production incidents.

Confusing capability with deployability

Cursor's FastRender shows that agents can generate large artifacts under heavy scaffolding and budget. That does not mean the artifacts are maintainable, secure, or economically useful. Production readiness requires quality review, security review, cost review, and ownership after the agent stops.

What to validate before granting agent authority

Before any agent gets autonomous task completion in production, the buyer should validate:

  • Predeclared thresholds: pass/fail decisions are written before the run, not negotiated after.
  • Multi-dimensional scoring: success rate, policy compliance, unsafe action count, cost, latency, and intervention rate all carry weight.
  • Baseline comparison: the agent is compared against the actual current workflow, not against "no automation."
  • Trace access: the buyer can read every transcript and the vendor cannot withhold logs.
  • Safety gates: irreversible actions pause for approval; spend limits, allowlists, and rollback plans are in place.
  • Ownership: someone owns the eval suite and updates it as policies, products, and incidents change.

Failure on any item means no autonomous production access. Supervised pilot is still on the table.

Share

Methodology

For each named benchmark and demonstration, we read the primary paper, project page, or vendor write-up, and credit only what the evidence supports. CRUX, the Holistic Agent Leaderboard, WebArena, OSWorld, τ-bench, APEX-Agents, ClawBench, BrowserArena, and the Cursor FastRender post-mortem are treated as primary sources. Anthropic's published agent-eval guidance is treated as practitioner consensus, not vendor marketing. Where benchmarks are recent or vendor-adjacent, we say so. Many 2026 sources are preprints or primary demos. The five-layer eval surface and the seven-step graduation protocol are operator-shaped frameworks built from the convergent evidence; the day-by-day playbook is the smallest version one team can run in two weeks.

Sources

  1. CRUX, "Open-world evaluations for measuring frontier AI capabilities."
  2. CRUX #1, "Can AI agents autonomously develop and publish an iOS app?"
  3. Sayash Kapoor and Arvind Narayanan, "Open-world evaluations for measuring frontier AI capabilities."
  4. Holistic Agent Leaderboard, "The Missing Infrastructure for AI Agent Evaluation."
  5. Anthropic, "Demystifying evals for AI agents."
  6. WebArena, "A Realistic Web Environment for Building Autonomous Agents."
  7. OSWorld project page
  8. τ-bench, tool-agent-user benchmark
  9. APEX-Agents, professional-services benchmark
  10. ClawBench, live-web evaluation
  11. BrowserArena, open-web evaluation with step-level human feedback
  12. Cursor, scaling agents and the FastRender post-mortem
  13. Inspect AI evaluation framework
  14. METR, "Measuring AI Ability to Complete Long Tasks."

Tools mentioned