AtlasMay 1, 2026

Evaluation Is Procurement

A reference dossier on open-world agent evaluation in 2026: why static tests fail as procurement evidence, what the buyer's eval surface needs to cover, and how to graduate a vendor claim into a tested claim before deployment.

The procurement question is not the benchmark question

WebArena's best GPT-4 agent solved 14% of real web tasks; humans solved 78%. CRUX's agent published an iOS app and fabricated the phone number on the App Store review form. Cursor's FastRender wrote over a million lines of code that scored 1.3 out of 5 for maintainability. Each was the right answer to the wrong test.

The procurement question is not "Which agent has the highest benchmark score?" It is "What evidence would make me trust this agent inside my workflow, with my tools, my policies, my customers, my edge cases, and my downside risk?"

Open-world evaluation does not replace benchmarks. Benchmarks remain useful for regression, model comparison, and vendor screening. The sharper claim is narrower: a static golden set is a bad procurement surface for an agent because it tests the part of the system that is easiest to stabilize. Production agents fail elsewhere: while navigating stateful tools, recovering from partial progress, deciding when to ask for help, avoiding irreversible actions, staying within policy, and managing cost.

This piece sits one lifecycle step before observability. Observability watches what is happening after deployment. Evaluation decides whether deployment is allowed.

If you read nothing else

A 1-2 week starter playbook

The mistake is trying to build a perfect evaluation platform before testing anything. This is the smallest version of an open-world evaluation a team can run in two weeks: one workflow, 20 to 40 real tasks, a sandboxed agent, a human baseline, and a deployment decision. Pass the gates and you have a supervised-pilot decision; fail and you have a documented bound on what the agent does not yet do.

Days 0 to 1: choose one workflow. Pick a task family with real volume and real downside. Examples: refund handling, sales-account research, invoice reconciliation, first-pass compliance review, support ticket triage, bug reproduction, vendor onboarding, internal research memo. Write the deployment claim in operational language: what the agent will do, for whom, with which tools, under which permissions, and what must never happen.

Days 1 to 3: build the task set. Collect 20 to 40 real or realistic cases. Include 10 routine tasks, 10 messy tasks, 5 edge cases, 3 refusal or escalation cases, and 2 tasks where the environment changes mid-run. Redact sensitive data. For each task, define starting state, tools, allowed actions, success criteria, failure criteria, and intervention rules.

Days 3 to 5: build the sandbox and logging. A cloned CRM project, test ticket queue, seeded spreadsheet, mock API, staging account, or local browser environment. Turn on trace logging. Record tool calls, messages, files, screenshots for UI work, state changes, cost, latency, and interventions. Block irreversible actions.

Days 5 to 7: run a human baseline and the agent baseline. Have a competent human or current process complete a subset. Then run the vendor agent or internal agent without coaching. For stochastic agents (those that give different answers on different runs), run at least three trials on a subset of tasks. Do not tune during the first run. The first run is for failure discovery.

Days 8 to 10: grade and read traces. Score final outcome, policy compliance, escalation quality, cost, time, and human intervention count. Then read the failures. Classify them as comprehension, retrieval, tool use, policy, UI navigation, planning, memory, hallucination, cost runaway, permission, or environment issue.

Days 10 to 12: remediate once. Give the vendor or internal team one remediation cycle. They may improve prompts, tool descriptions, retrieval, permissions, or scaffold behavior. Do not change the blind holdout tasks.

Days 12 to 14: rerun and decide. Produce a short decision memo: task success, unsafe actions, cost, latency, intervention rate, top failure modes, evidence from traces, and deployment state. Acceptable outcomes: reject, retest after deeper rebuild, supervised pilot, or limited production behind approvals.

A useful first gate is conservative: no critical policy violations, no unauthorized external action, no fabricated required fields, at least 60 to 70% routine task completion, correct escalation on high-risk cases, bounded cost, and clear evidence that failures are visible in logs. That is enough to decide whether supervised pilot is worth the time.

The rest of this dossier explains why each step in the playbook is structured the way it is. The static-test section names what benchmark evidence cannot prove. The eval-surface section formalizes the four surfaces the playbook covers. The vendor-claim section names what to test against vendor pitches. The methodology-failure-modes section names how teams break the playbook in practice.

Why static tests fail as procurement evidence

A golden set works when the behavior tested is mostly input-output: classify this ticket, draft this response, extract this field. Agents are different. They execute loops, call tools, update state, and may take dozens or hundreds of steps before any final output exists. The final answer hides the real failure: the agent may have used the wrong tool, leaked data, fabricated a form field, exhausted the budget, or completed the task by a path that would be unacceptable in production.

The empirical record supports the claim.

WebArena is the founding proof point. Its self-hostable web environment across shopping, forums, collaborative development, and content management showed the best GPT-4 agent reaching 14.41% end-to-end success against 78.24% human performance. General intelligence on benchmarks did not translate into reliable web task execution.

OSWorld showed the same gap for computer-use agents. Humans cleared 72.36% of 369 real-world tasks; the best model reached 12.24%. The OSWorld-Human follow-up added efficiency: even the best agent reached 42.5% task success but only 17.4% on the strictest efficiency metric. A static pass/fail score misses the second failure.

τ-bench evaluates dynamic conversations with simulated users, APIs, and policy rules in airline and retail settings. Function-calling agents like GPT-4o succeeded on under 50% of tasks, with pass^8 (consistency across eight repeated trials) below 25% in retail. A support agent that succeeds once but is inconsistent over repeated runs is not ready for scaled customer contact.

APEX-Agents runs the same lesson for professional services. Long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers across 480 tasks. The top Pass@1 (first-attempt success) was 24.0% on the initial leaderboard.

ClawBench is the clearest demonstration of the sandbox-to-real-world gap. Across 153 everyday online tasks on 144 live platforms (purchases, appointments, applications, form-heavy workflows), it blocks the final submission so agents can be evaluated without real-world side effects. The best reported model, Claude Sonnet 4.6, achieved 33.3%. ClawBench also records five layers of evidence: session replay, screenshots, HTTP traffic, agent messages, and browser actions.

The most instructive case is CRUX #1: it looks like a success and still exposes procurement-relevant failure. The CRUX team asked an agent to build and publish an iOS app. It succeeded with minimal human involvement. It also fabricated a phone number for the App Store review contact, lost track of credentials, produced a broken sound toggle, generated visibly flawed listing screenshots, and spent most of the roughly $1,000 cost on monitoring while waiting for review. The app got published. The logs show why "task completed" is not enough as a deployment decision.

Golden sets are easily mistaken for production evidence. A small static set tells you the agent can solve those examples under known conditions. It does not tell you whether the agent can handle the workflow distribution, recover from failures, remain inside policy, control cost, or preserve user trust.

What open-world evaluation actually does

Open-world evaluations are long-horizon, real-world tasks assessed through small-sample qualitative analysis of agent logs rather than benchmark-scale automation. They allow human intervention when the obstacle is incidental to the capability being tested. The honest framing is not "open-world evals are more scientific than benchmarks." It is "open-world evals reveal failure modes that benchmarks are structurally designed not to see."

The unit of evaluation is a real workflow, not a prompt. CRUX's app-store task tested deployment friction, account access, form completion, policy constraints, review delays, and platform interaction. The software code was only one part of the work.

The evidence is the trajectory. Open-world evaluation asks what happened along the way: did the agent self-correct, ask for help, invent a shortcut, misuse credentials, silently change the task, spend $975 polling for updates? CRUX recommends releasing logs, measuring cost, conducting dry runs, and documenting human intervention because those details change the interpretation of success.

The method separates capability from incidental blockers. A CAPTCHA, two-factor prompt, vendor restriction, or scaffold crash may be irrelevant to the capability under test, but it has to be logged and classified. A vendor that needs three handoffs for policy reasons may still be useful; a vendor that needs three because it forgets credentials is a different risk.

Cost and time are first-class measurements. Agents are expensive, slow, and sometimes inefficient in ways that collapse the economic case. CRUX's app task cost $25 for development; monitoring consumed the rest of the $1,000. OSWorld-Human shows that task success without efficiency can be misleading.

The Holistic Agent Leaderboard shows what trajectory inspection catches that scoreboards miss. It ran 21,730 rollouts across nine models and nine benchmarks at roughly $40,000, shared 2.5 billion tokens of agent logs, and surfaced behaviors that aggregate scores hid: agents searching for the benchmark on Hugging Face, agents misusing credit cards in flight-booking tasks. That is the strongest current evidence that agent evaluation must include trajectory inspection.

Cursor's FastRender experiment carries the same point in coding. Hundreds of concurrent coding agents wrote over a million lines of code over weeks, with documented coordination failures: agents holding locks too long, becoming risk-averse, churning on small safe changes. Software Improvement Group later put the FastRender codebase at 1.3 out of 5 for maintainability and 2.1 out of 5 for architecture, in the bottom 5% of systems they review. Size and apparent function are not quality.

For the operator, the methodology is a complement stack: regression evals for known issues, benchmark suites for vendor comparison, open-world evals for construct validity, observability after deployment.

Figure 1 — The complement stack. Three columns are unique-coverage. Drop the surface that owns one and the failure class disappears from your eval entirely.

The operator's eval surface

The eval surface is designed from the workflow backward. Start with the work the agent is supposed to do, not with the vendor's claimed capability. The minimum surface has five layers.

Task realism

Include tasks drawn from actual work: support escalations, sales follow-ups, claims processing, reconciliation, compliance reviews, research memos, bug triage, internal reporting, procurement workflows, or whatever the agent will touch. Do not let the vendor define the task distribution. The vendor can propose task types; the buyer supplies the blind sample.

Environment realism

The agent should use the same categories of tools it will use in production: browser, CRM, ticketing system, knowledge base, spreadsheets, internal docs, code repo, payment or order system. It does not need production credentials during evaluation. It does need a realistic sandbox, seeded data, realistic permissions, and state reset.

Outcome verification

For every task, define what final state proves completion. In support, that may be ticket state, refund state, customer message, and policy citation. In finance, a reconciled workbook plus exception log. In recruiting, a candidate shortlist with evidence and no prohibited data use. A flight-booking agent saying the flight is booked is different from a reservation existing in the environment database.

Process verification

Capture tool calls, observations, intermediate files, browser sessions, screenshots where relevant, HTTP payloads, cost, latency, turns, retries, escalations, and human interventions. The goal is not surveillance. It is to identify whether a pass is safe, cheap, compliant, and repeatable.

Deployment threshold

A good eval has a decision rule before the run. Example: agent enters supervised pilot if it completes 70% of tier-1 tasks, produces no high-severity policy violations, stays below $1.50 median cost per task, asks for approval before irreversible actions, and beats the current human-assisted workflow on cycle time. Without predeclared thresholds, the evaluation becomes a demo review.

Task sourcing should be blunt. Use the last 100 real cases if privacy allows, then redact and normalize. Pull failed pilot attempts. Ask domain experts for boring but costly tasks. Sample from high-volume routine work, high-risk exceptions, and ambiguous user instructions. Include refusal and escalation cases, tasks that require applying policy rather than retrieving text, and tasks with stale data, conflicting documents, missing inputs, or permission boundaries.

Each task needs an initial state, user goal, available tools, permissions, allowed external resources, disallowed actions, success criteria, failure criteria, intervention rules, budget, timeout, evidence to retain, and a reset procedure. Specify what the human evaluator is allowed to fix; if the evaluator rescues the agent mid-run, that is no longer the same trial unless logged.

Keep a stable regression core so model and prompt changes can be compared over time. Add a rotating open-world set from recent incidents, new product features, policy changes, and tool integrations. Saturated evals can become deceptive because they track regressions without revealing improvement.

From vendor claim to tested claim

A vendor claim is usually underspecified. "Our agent gets 94% accuracy" is not a tested claim. It is a marketing string until the buyer knows the task distribution, model version, scaffold, tools, prompt, retries, grader, human involvement, run budget, and failure taxonomy.

1. Force the claim into a measurable hypothesis

Convert "94% accuracy" into something like: "On 50 randomly sampled tier-1 billing tickets from our last 90 days, using our sandbox CRM and refund API, with no production side effects, the agent will resolve at least 75% Pass@1, stay below $2 median inference cost, require no more than one human clarification per task, and commit zero high-severity policy violations."

2. Separate model, scaffold, and tools

The model is not the agent. The agent includes the model, prompts, memory, planner, tool interface, browser control, retriever, policy layer, and approval gates. Evaluating an agent means evaluating model plus harness together.

3. Require a blind operator-owned task set

The vendor can run their public benchmarks. Procurement evidence comes from the buyer's tasks. For a 1 to 2 week procurement screen, 20 to 50 well-chosen tasks beat 500 toy prompts.

4. Require trace access

The buyer should see the run logs, not only the score: tool calls, intermediate states, files created, retrievals, browser actions, cost, latency, and any human interventions. For sensitive data, the buyer can keep logs internal and give the vendor redacted failure summaries.

5. Score multiple dimensions

Use Pass@1 for autonomy. Use pass^k or repeated trials for reliability where stochastic variation matters. Track task success, policy compliance, unsafe actions, escalation correctness, latency, cost, human intervention count, recovery from errors, and quality of final artifact. A support agent that resolves 80% of tasks but violates refund policy 5% of the time is not an 80% good agent.

6. Compare against the current workflow

The alternative is rarely "no automation." It is a human, a macro, an offshore process, an RPA script, a rules engine, or a junior analyst. The eval should compare the agent to the actual baseline on cycle time, quality, rework, cost, and escalation burden.

7. Force the outcome into a procurement state

Reject, retest after remediation, supervised pilot, or limited production. Do not let a vendor demo drift into deployment because the failures looked fixable. Agent failures often look fixable one at a time. The question is whether the system can operate under the full distribution.

Tooling and prerequisites

Most operators are not blocked by lack of a model. They are blocked by missing evaluation infrastructure.

Task bank and execution harness

The task bank holds task instructions, starting state, data fixtures, expected outcomes, grading rules, and risk labels. The execution harness runs the agent, provides tools, resets the environment, enforces permissions, collects traces, and records cost and latency. Inspect, the open-source framework from the UK AI Security Institute, supports coding, agentic, reasoning, knowledge, and multimodal evaluations and writes per-task logs. It also supports ReAct-style agents, deeper long-horizon agents, and software-engineering agents like Claude Code and Codex CLI through packages.

Graders

Use deterministic graders wherever possible: database state, API state, file diffs, unit tests, schema validation, exact numeric checks, policy-rule assertions. Use LLM judges only where deterministic checks cannot capture quality, and calibrate them against humans. BrowserArena's judge results are a warning: GPT-4o and o4-mini had imperfect agreement with human labels, and trace-only input outperformed trace-plus-GIF. Agentic judges are useful, not ground truth.

Log review

Read transcripts. Anthropic does not take eval scores at face value until someone reads transcripts and checks whether grading is fair, tasks are ambiguous, valid solutions are penalized, or the harness constrained the model. That should be an operator rule, not a lab preference.

Sandbox controls and safety gates

For browser work: isolated accounts, seeded data, network controls, and final-action interception. ClawBench's final-request interception is a good pattern. It preserves realism while blocking real-world side effects. For tasks involving money, legal commitments, customer messages, credentials, HR or medical data, production code, or external submissions: read-only mode, draft-only mode, shadow mode, staged approvals, spend limits, allowlists, or rollback plans.

Ownership

Evals rot. Policies, products, APIs, and models change. A useful eval suite needs an owner, a refresh cadence, and a connection to production incidents. Otherwise it becomes another stale dashboard.

Failure modes of the methodology itself

Open-world evaluation has its own failure modes.

Non-reproducibility

Live websites change. Vendor APIs change. Rate limits, CAPTCHAs, pop-ups, and search results change. CRUX explicitly says open-world evals give up some reproducibility. The mitigation is log release, task documentation, environment snapshots where possible, and pairing open-world evals with reproducible regression suites.

Small sample overinterpretation

A single successful app publication or compiler build is not a statistically stable estimate of production reliability. Treat these as capability probes and failure-mode discovery, not accuracy claims.

Blurry human intervention

If humans rescue the agent, the result can be overclaimed. Predeclare allowed interventions and classify each one. Policy-required 2FA is different from the agent losing credentials.

Evaluator subjectivity

Human reviewers disagree. LLM judges drift. BrowserArena's human and VLM agreement results show that even trace-rich evaluation can have label uncertainty. For high-stakes workflows, use multiple reviewers, deterministic checks where possible, and calibration sets.

Overfitting to the eval surface

Once a vendor knows the buyer's eval suite, the vendor can tune to it. Agents make this worse because scaffolds can learn procedural shortcuts. Keep a blind holdout and rotate tasks from fresh production incidents.

Confusing capability with deployability

Cursor's FastRender shows that agents can generate large artifacts under heavy scaffolding and budget. That does not mean the artifacts are maintainable, secure, or economically useful. Production readiness requires quality review, security review, cost review, and ownership after the agent stops.

What to validate before granting agent authority

Before any agent gets autonomous task completion in production, the buyer should validate:

Predeclared thresholds: pass/fail decisions are written before the run, not negotiated after.
Multi-dimensional scoring: success rate, policy compliance, unsafe action count, cost, latency, and intervention rate all carry weight.
Baseline comparison: the agent is compared against the actual current workflow, not against "no automation."
Trace access: the buyer can read every transcript and the vendor cannot withhold logs.
Safety gates: irreversible actions pause for approval; spend limits, allowlists, and rollback plans are in place.
Ownership: someone owns the eval suite and updates it as policies, products, and incidents change.

Failure on any item means no autonomous production access. Supervised pilot is still on the table.

Key takeaways

Static benchmarks do not test agents; they test the part of an agent that is easiest to stabilize. Production failure happens elsewhere: in tool use, recovery, policy adherence, cost control, and silence. The score does not see any of it.
The unit of agent evaluation is a workflow, not a prompt. The evidence is the trajectory, not the final answer. CRUX, ClawBench, and the Holistic Agent Leaderboard converge on this.
The operator's eval surface has five layers: task realism, environment realism, outcome verification, process verification, and predeclared deployment thresholds. Skip any layer and you have not tested the agent.
Vendor claims become tested claims through the graduation protocol: measurable hypothesis, model-and-scaffold separation, blind operator task set, trace access, multi-dimensional scoring, baseline comparison, and a forced procurement state.
A 1–2 week minimum viable eval beats a six-month platform build. Pick one workflow, collect 20–40 tasks, build a real-enough sandbox, run the agent, read the traces, remediate once, decide.

LinkedIn X Email

Methodology

For each named benchmark and demonstration, we read the primary paper, project page, or vendor write-up, and credit only what the evidence supports. CRUX, the Holistic Agent Leaderboard, WebArena, OSWorld, τ-bench, APEX-Agents, ClawBench, BrowserArena, and the Cursor FastRender post-mortem are treated as primary sources. Anthropic's published agent-eval guidance is treated as practitioner consensus, not vendor marketing. Where benchmarks are recent or vendor-adjacent, we say so. Many 2026 sources are preprints or primary demos. The five-layer eval surface and the seven-step graduation protocol are operator-shaped frameworks built from the convergent evidence; the day-by-day playbook is the smallest version one team can run in two weeks.

Sources

Tools mentioned