Build an Ungameable Eval
A reference dossier for the AI buyer, head of platform, or applied-AI lead about to sign a contract with an AI vendor (model API, agent platform, copilot, support bot, coding assistant): the eight components of an eval set that decides procurement, the tooling that supports each one, and the line between a private eval and the AI vendor's marketing.

TL;DR
The decision isn't which AI vendor scored highest on a benchmark or which one demoed best on the team's sample prompts. That decision sits in the AI vendor's sales motion. The decision here is whether the team owns an ungameable eval, meaning one the AI vendor can't see, can't train against, can't score, or wait out. Ungameable isn't a claim about vendor honesty; it's a structural property of the eval itself. A vendor that wants to play straight is still bound by what its model trained on, what its judge family knows, and what's changed since the team last refreshed the set.
Evaluation is procurement. An AI vendor that wrote the eval sold the team the answer; a public benchmark is theater the moment it's famous enough to win on; a private eval built once and left unchanged is a museum piece by the second model release. AI vendor benchmarks are adversarial evidence the same way a pitch deck is: useful for forming hypotheses, never enough for a decision.
Eight components make an eval ungameable. They cover dataset, scoring, taxonomy, cadence, baselines, reporting, adversarial coverage, and the procurement lock that ties eval results to contract clauses. Skip any of the eight and the AI vendor gets to grade itself on the one that's missing.
A starting tool stack pairs one eval-of-record platform with a specialist judge layer and an adversarial layer the buyer controls:
- Braintrust or LangSmith for dataset management, scoring orchestration, and regression tracking
- Inspect AI for high-sensitivity adversarial and agentic evaluation in sandboxed environments
- Ragas for retrieval-grounded answer evaluation when the system reads from a corpus
- Label Studio or Argilla for human labeling and inter-annotator agreement
- Garak or Promptfoo red-team for prompt-injection and jailbreak coverage
Tooling cost runs from free (Inspect AI, Promptfoo CLI, Helicone Hobby) to $39 per seat per month (LangSmith Plus) and $249 per month (Braintrust Pro), with enterprise plans for hosting on the buyer's own infrastructure when sensitivity requires it. The number worth tracking is the cost of a procurement decision made against a gameable eval.
Buying the eval is the team's job, not the AI vendor's. A platform can store the dataset and run the scorers; it can't decide which failures break the deal.
Three AI Vendors That Sold the Same Eval
A 40-person product team ran a side-by-side bake-off across three AI vendors using a public benchmark suite the team had quietly assembled from open datasets. Every AI vendor scored within two points of the others on the headline number, and the team picked the cheapest. Six months later the chosen AI vendor's tone, refusal posture, and citation quality were visibly different from a competitor the team had ruled out, and the rollback cost a quarter of platform work. The benchmark didn't lie; it just answered a different question. OpenAI's GPT-4 Technical Report reports about 25 percent overlap between GPT-4 pretraining data and HumanEval, with a roughly 2.12 percentage-point degradation after removing contaminated examples. Once a public benchmark is famous enough to win on, it's also famous enough to leak into training data and prompt-engineering folklore. The failure moment wasn't a hallucination; it was the line item that said "evaluated against industry-standard benchmarks." Call it the public-benchmark plateau.
A 200-person SaaS company built an internal eval and ran it across three AI vendors competing for a support-bot replacement. Vendor A scored 87, Vendor B scored 82, Vendor C scored 76. Procurement picked A and started rollout. The Spanish-speaking accounts started complaining inside the first month: tone violations, missed escalations on refund requests, three reported cases of the bot inventing a policy clause that didn't exist. The team rebuilt the eval with a stratified slice: 30 percent English, 25 percent Spanish, 20 percent code-mixed, 15 percent transcripts with audio-to-text errors, 10 percent prompt-injection cases. On the stratified slice, A scored 71, B scored 79, and C scored 81. The team had signed for the wrong AI vendor because the original eval averaged the only failures that mattered into a number that looked fine. Notion's eval program, built on Braintrust with about 70 engineers aligned on evaluation workflows, names the same pattern: the team catches multilingual behavior issues only after building targeted failure datasets for APAC use cases. Call it the missing slice.
A 30-engineer infrastructure team locked in an AI vendor in late 2025 after a clean pilot, scored against an internal eval set the team was proud of. The eval ran twice during procurement and then went into the wiki. Over the next six months the AI vendor shipped two model updates, the team rewrote the system prompt, and the retrieval corpus grew by 40 percent. By spring 2026 the support team's escalation rate was climbing and product was getting customer-success tickets the bot used to handle. Nobody re-ran the eval until a board-prep review. The numbers had dropped 14 points across every category, and the regression had built over months without a single alert. Call it the stale snapshot.
Three teams, three failures, one category mistake. They ran an eval, declared it done, and let the AI vendor own the next move. None of those failures needed a cheating AI vendor; the benchmark leaked, the slice was wrong, and time did the rest.
The lifecycle anchor is the Promptfoo acquisition. OpenAI announced in 2026 that it would acquire Promptfoo, one of the most-used eval and red-team platforms in the buyer market. The technology may be better off inside OpenAI; the buyers who used Promptfoo as their independent governance layer against OpenAI now own a conflict to write down. Choosing an eval platform without thinking about the AI vendor map is its own failure mode.
Action Plan
Days 1 to 7: Write the failure taxonomy before you write the eval. Pull a week of production interactions and tag each one against a draft taxonomy of failure classes for the task. For a support bot, that's wrong policy, invented refund, missed escalation, privacy breach, tone violation, over-refusal, format breakage. For RAG, that's unsupported answer, missing citation, wrong citation, retrieval miss, hallucinated source. Assign severity weights with the function owner, and name the zero-tolerance classes before any AI vendor sees the test. Resist building the dataset this week. The week-one deliverable is a one-page taxonomy with severity weights and named zero-tolerance failures, signed off by the team that owns the budget for failure cleanup.
Days 8 to 14: Build the dataset and the splits. Pull 200 to 500 examples from production, stratified by task class, customer segment, language, data source, and known edge case. Tag each example with task class, risk category, expected answer, severity weight, and whether it's evergreen or drift-sensitive. Split three ways: 40 percent shareable development examples the vendor may see, 25 percent internal validation for team iteration, 35 percent held-out acceptance the AI vendor never sees. Add a smoke set of 25 cases for change-by-change checks. The Friday deliverable is a dataset version 1 in the eval platform, tagged and split, with the held-out acceptance set stored in a project the AI vendor's account can't touch.
Days 15 to 30: Run the bake-off and write the contract. Score the current system, the candidate vendor, and one credible alternative against the held-out set. Calibrate every LLM judge against a human-labeled sample before allowing it into the acceptance run. Report per-category, per-severity, per-example. Map the results into contract schedules: pass criteria, regression definition, cure period, rollback rights, exit terms. Don't sign the contract until the eval results and the contract clauses are in the same review meeting. If the AI vendor refuses private evaluation or refuses to tie regression to remediation, the AI vendor is failing procurement.
The Eight Components and the Rules That Make Each Ungameable
Each sub-case below follows the same template: the component, what ships clean, the ceiling, the rule that makes it ungameable, and the Friday action line. A short table of production examples closes each one.
Dataset Construction and Privacy Isolation
The work is choosing the examples that decide procurement. Where they come from, how they're stratified, how PII is handled, and how the held-out set stays out of the AI vendor's training and optimization loop.
The slot belongs to a buyer-controlled platform with self-hosting available for sensitive workloads. Braintrust's self-hosted setup keeps logs, datasets, prompts, model outputs, human review scores, and judge keys in the buyer's own cloud account (AWS, GCP, or Azure). LangSmith Enterprise offers hybrid and self-host options so data stays inside the buyer's VPC. Weights & Biases Weave runs on Dedicated Cloud or Self-Managed for residency and isolation. Inspect AI is open-source and runs locally, which makes it the default for the highest-sensitivity slices.
What ships clean:
- Examples drawn from real production inputs, stratified by task class, failure risk, segment, language, and known edge case
- Three splits with hard boundaries: development the vendor sees, validation the team uses, acceptance the AI vendor never sees
- Metadata per example: source date, product surface, expected answer, allowed tools, risk category, severity weight, drift-sensitivity flag
- A continuous pipeline that converts production failures into new held-out cases on a weekly cadence
The ceiling appears at synthetic-only or public-only datasets. Synthetic examples cover the edges the production set won't reach but can't replace it as procurement evidence, and public benchmarks are scouting tools that have leaked into too many training corpora to decide on. Once a public set is famous enough to win on, it's already famous enough to have been trained against. The named failure mode is the public-benchmark plateau.
The eval is ungameable when the AI vendor sees only the bounded development slice, the acceptance set stays in storage the AI vendor's account can't reach, and weekly production failures continuously refresh the hidden pool.
If you start this week, pull 200 production examples, tag each with task class and risk, store the held-out 100 in a project the AI vendor's seat can't access, and write a one-paragraph privacy posture into the procurement file by Friday.
Examples of what this looks like in production:
| Use case | Dataset shape | Stack |
|---|---|---|
| B2B SaaS support bot | 200 production tickets, 100 held-out, weekly drift refresh | Braintrust self-hosted |
| Regulated healthcare workflow | 500 redacted production examples, air-gapped acceptance set | Inspect AI on buyer infrastructure |
| Multilingual consumer support | 1,000 stratified tickets across five languages with native-speaker review | LangSmith Enterprise |
Scoring Methodology and Judge Architecture
The work is deciding how each output is judged: rule-based checks, code checks, human review, LLM-as-judge, pairwise comparison, mixed scoring.
The slot belongs to a layered architecture. Deterministic rules check anything a machine can verify: schema, refusal policy, citation count, latency, and cost. LLM judges, drawn from a different vendor family than the candidate, handle the dimensions that scale beyond human review, once the team has calibrated them against human-labeled examples. Human reviewers handle the high-stakes calls: tone, legal correctness, and ambiguous factuality. Braintrust, LangSmith, Promptfoo, Weave, and Inspect AI all support layered scoring. The structural choice isn't the platform; it's the judge map.
What ships clean:
- Deterministic checks for every machine-checkable property (valid JSON, required fields, citation presence, latency, cost)
- LLM judges drawn from a vendor family that isn't the model under test, with documented calibration against human-labeled samples
- Human review on a 25 percent double-labeled slice for inter-annotator agreement
- Pairwise comparison where the question is relative preference, not absolute correctness
The ceiling appears when the model under test grades itself, when the judge comes from the same vendor family as the candidate, or when the rubric is vague enough that every output passes. MT-Bench and Chatbot Arena research identifies position bias, verbosity bias, and self-enhancement bias in LLM judges. OpenAI's own eval guidance says model grading has an error rate, should be validated with human evaluation, and ideally uses a different grading model than the completion model. Promptfoo's default judge picks the same model family as the API key you give it: GPT if you give it an OpenAI key, Claude if you give it an Anthropic key. That's fine when engineers are iterating. It's dangerous in procurement, because the AI vendor you're evaluating ends up grading itself. The named failure mode is judge contamination.
The eval is ungameable when three things hold: the AI vendor can't see the judge setup, every automated judge has been calibrated against human-labeled examples, and the judge model comes from a different vendor family than the candidate.
If you start this week, pick one task class, draft the rule-based checks, write the rubric for the LLM judge, label a 50-example calibration set by hand, and reject any judge that doesn't agree with the human labels above the threshold the team agreed to.
Examples of what this looks like in production:
| Task | Scoring layer | Stack |
|---|---|---|
| Support intent classification | Deterministic schema check plus LLM judge calibrated against human labels | Braintrust plus Label Studio |
| Refund eligibility reasoning | Human review on every Severity-1, LLM judge on the bulk | LangSmith plus Argilla |
| Multilingual tone scoring | Pairwise comparison and native-speaker review | Ragas plus Surge AI |
Failure-Mode Taxonomy and Severity Policy
The work is naming what "wrong" means for the task. A generic pass rate hides the only failures that matter to the business.
The slot belongs to the function owner who pays for failure cleanup, not the eval engineer. The taxonomy lives next to the dataset in the eval platform, with severity weights assigned before any candidate sees the test. Braintrust, LangSmith, Promptfoo, Weave, and Inspect AI all support per-example metadata. Label Studio and Argilla help when the taxonomy needs human consensus before it locks.
What ships clean:
- Every example tagged with one or more failure categories before scoring
- Severity weights agreed in writing with the function owner before the bake-off starts
- Zero-tolerance failure classes named explicitly, with the rule that one Severity-1 failure blocks acceptance regardless of aggregate score
- A misroute review that adds new categories when production surfaces a failure mode the taxonomy didn't anticipate
The ceiling appears when the team reports one aggregate quality number. Aggregates make weak AI vendors look acceptable because catastrophic failures get averaged into the mean. A 92 percent pass rate is useless if the 8 percent includes data leakage, unauthorized refunds, privilege escalation, or invented legal advice. The named failure mode is dashboard theater.
The eval is ungameable when failure categories and severity weights are buyer-owned, hidden from the AI vendor during acceptance, and tied to hard gates the AI vendor can't argue with after the fact.
If you start this week, run a tagging workshop with the function owner and engineering, assign severity weights to the top 12 failure categories, name the zero-tolerance set, and write the policy into the procurement file before any AI vendor runs the eval.
Examples of what this looks like in production:
| Task | Failure categories tagged | Stack |
|---|---|---|
| Support bot | Wrong policy, invented refund, missed escalation, tone violation | Internal taxonomy doc plus Label Studio |
| Coding agent | Syntax error, security bug, wrong file edited, dependency breakage | Internal taxonomy doc plus Inspect AI |
| RAG application | Unsupported answer, wrong citation, retrieval miss, hallucinated source | Internal taxonomy doc plus Argilla |
Cadence, Freshness, and Drift Capture
The work is keeping the eval current with the system it scores. Production data drifts as customers and the product evolve, the AI vendor ships new model versions between renewals, and engineering keeps rewriting the system prompt against whatever the current model does best. A frozen eval becomes a launch snapshot that goes blind to all of it.
The slot belongs to platforms that pull production traces back into the eval dataset. Braintrust feeds production traces into datasets for refresh. LangSmith converts online issues into offline test cases and supports dataset versioning with tagged versions for specific runs. Weave connects evaluation with tracing, scorers, CI automations, and alerts. Helicone is useful upstream as observability and trace export, less so as eval-of-record after the Experiments feature was removed on September 1, 2025.
What ships clean:
- A smoke set of 25 cases on every change to model, prompt, retrieval, tool, or policy
- A full regression nightly or before merge for high-risk systems
- A production-sampled drift set weekly, with new failures auto-promoted to candidate eval cases
- A procurement acceptance set rerun before renewal, AI vendor expansion, or any material model upgrade
The ceiling appears when the eval freezes. A frozen eval is easy to pass, easy to forget, and overfit to the data the team had when the eval was built. AI vendor releases ship faster than annual reviews. The named failure mode is the stale snapshot.
The eval is ungameable when refresh cadence is shorter than the AI vendor's release cycle, the AI vendor can't predict the refresh slice, and production failures automatically create candidate cases on a schedule the AI vendor doesn't see.
If you start this week, set up a weekly job that samples 50 production interactions, runs them through the current scoring layer, surfaces failures to a human reviewer, and pushes any confirmed failure into the held-out pool.
Examples of what this looks like in production:
| Cadence | Trigger | Stack |
|---|---|---|
| Smoke set (25 cases) | Every prompt, model, retrieval, or tool change | Braintrust in CI |
| Nightly regression | High-risk system pre-merge | LangSmith |
| Weekly drift sample (50 cases) | Auto-promote confirmed failures into the held-out pool | Braintrust plus LangSmith |
Comparative Baseline Design
The work is choosing the comparisons that answer the right questions. A single AI vendor scored in isolation tells the team almost nothing about whether to buy.
The slot belongs to platforms that keep permanent records of each run and support side-by-side comparison. Braintrust experiments are locked once they run, which is the right shape for comparing model, prompt, retrieval, and tool variants without anyone editing the history. LangSmith regression tests highlight regressions and improvements relative to a baseline. Weave supports comparing model objects against the same dataset. Promptfoo handles matrix comparisons across prompts, providers, and assertions. Inspect AI supports benchmark-shaped comparisons across models and agents.
What ships clean:
- A current-system baseline that answers "does this beat what we already have"
- A competing-AI-vendor baseline that answers "is this the best buyable option"
- A human baseline that answers "where does automation become unsafe or uneconomic"
- A prior-AI-vendor-version baseline that answers "did the upgrade regress"
The ceiling appears when the buyer compares only against the AI vendor's chosen baseline. A vendor will pick a weak baseline, a public set, a stale model, or a metric where it shines, because that's what selling looks like when the buyer hasn't done the choosing. The named failure mode is the seller-picked baseline.
The eval is ungameable when baselines are buyer-selected, run blind where possible, frozen before any candidate runs, and rerun on the same hidden acceptance set after a candidate change.
If you start this week, pick the three baselines that matter for the decision in front of the team, freeze them on the current eval set, and reject any AI vendor pitch that proposes a different baseline mid-pilot.
Examples of what this looks like in production:
| Baseline | Question it answers | Stack |
|---|---|---|
| Current production system | Does the candidate beat what we already ship? | Braintrust experiments |
| Competing AI vendor | Is this the best buyable option? | Promptfoo matrix comparison |
| Prior version of same AI vendor | Did the upgrade regress against the same held-out set? | LangSmith regression test |
Granularity and Reporting
The work is structuring the output of the eval so the team can act on it. A single summary number tells the steering committee something while telling the team responsible for failures almost nothing.
The slot belongs to platforms that combine experiments, traces, comments, human review, and example-level inspection. Braintrust and LangSmith are strongest for cross-functional review. Inspect AI is strongest for engineering-grade reproducibility, weaker for business-user review unless the team builds reporting on top. Weave fits teams already in the W&B stack. Promptfoo is strong for CLI and CI reports.
What ships clean:
- Overall pass rate, category pass rates, severity-weighted score, p95 latency, cost per call, refusal rate, escalation rate, tool-call success, citation correctness, and critical-failure count, all on the same page
- Example-level failure inspection so the team can read what actually broke
- A procurement-shaped summary that opens with the decision, not the score
- A diagnostic layer for the team that owns remediation, separate from the executive view
The ceiling appears at single-number reporting. A green aggregate makes the buyer feel informed while hiding the failure class that triggers a board escalation later. The named failure mode is dashboard theater.
The eval is ungameable when the AI vendor can't choose the summary number, hide rare failures, or substitute a metric that looks better than the buyer's decision rule.
If you start this week, draft the procurement summary template for the next eval run, name the categories and severities that appear on page one, and reject any AI vendor report that arrives as a single number with no per-category breakdown.
Examples of what this looks like in production:
| Audience | Report shape | Stack |
|---|---|---|
| Function owner | Procurement summary with category passes, severities, costs, latencies | Braintrust |
| Remediation team | Per-example failures with input, output, judge rationale | LangSmith |
| Steering committee | Decision-first one-pager backed by per-category drilldown | Custom report on top of Braintrust API |
Adversarial and Safety Coverage
The work is testing the surfaces AI vendors don't show in demos: prompt injection, jailbreaks, unsafe requests, data leakage, policy bypass, harmful content, tool misuse, and adversarial edge cases that hit production but skip pitch decks.
The slot belongs to specialist red-team tooling stacked on a sandboxed runner. Garak is a vulnerability scanner that probes hallucination, prompt injection, data leakage, misinformation, toxicity, and jailbreaks. Promptmap sends attack prompts and uses a controller model to evaluate success. Promptfoo's red-team plugins cover prompt injection, jailbreak, data leakage, and tool misuse, with the caveat that OpenAI's acquisition introduces a governance footnote when the buyer is evaluating OpenAI. Inspect AI supports tool calling, agent evals, sandboxes, tool approval, and custom scorers, which makes it the default for controlled red-team work. HELM Safety and MLCommons AILuminate are useful public references, not substitutes for buyer-owned cases.
What ships clean:
- Adversarial cases drawn from the buyer's actual attack surface (real ticket text, real documents, real tool descriptions)
- Hidden canary cases that test prompt-injection compliance and data leakage without telling the AI vendor what's being tested
- A refresh process that adds new incidents and exploit patterns to the adversarial pool as they appear
- A separate severity track for safety failures, with zero-tolerance defaults
The ceiling appears when adversarial coverage is the AI vendor's safety card or a generic jailbreak leaderboard. Public safety benchmarks are reference points, not proof that the AI vendor handles the team's tool access, documents, and escalation rules. The named failure mode is theater coverage.
The eval is ungameable when adversarial cases come from the buyer's real attack surface, the AI vendor doesn't see the canary set, and the refresh process feeds real incidents into the pool.
If you start this week, pick the top three attack surfaces for the system being procured, write 20 adversarial cases per surface from real or realistic inputs, and add a five-case canary subset the AI vendor will never see.
Examples of what this looks like in production:
| Surface | Attack pool | Stack |
|---|---|---|
| Support bot | Prompt injection in ticket text, attached docs, and email threads | Garak plus Inspect AI |
| Browser agent | Credential, payment, destructive-action, and instruction-hierarchy cases | Promptfoo red-team plus Inspect AI |
| RAG application | Malicious documents that try to override the system prompt | Garak plus Promptfoo |
Procurement Lock: Decision Rules, Contracts, and Exit Rights
The work is converting eval results into business consequences. A beautiful eval report with no contract leverage produces leverage for the AI vendor, not the buyer.
The slot belongs to procurement and legal, working from the eval results before signature. Public guidance supports the building blocks. The Society for Computers and Law's AI clauses project says AI requirements are often better expressed as benchmarked measurable outcomes than as generic standards. The Forum for Cooperation on AI's procurement framework lists clauses covering purpose, data rights and retention, system performance, monitoring, incident-management SLAs, KPIs, upgrade approvals, lock-in management, and periodic audits. Shumaker's benchmark-clause article makes the procurement point directly: benchmark requirements convert AI promises into leverage for remediation, service credits, or exit rights.
What ships clean:
- Acceptance thresholds in a contract schedule: overall pass rate, category pass rates, zero-tolerance failures, p95 latency, cost ceiling, data-handling controls
- A regression definition with named thresholds and cure periods
- Data non-use clauses that prevent the AI vendor from training on buyer inputs, outputs, feedback, labels, or evaluation materials
- Audit and isolation rights including subprocessor disclosure, annotation workflows, and model-change notice
- Termination and exit clauses with prorated refunds, data export, and transition assistance when the eval regresses past the cure threshold
The ceiling appears when the eval produces a memo and the contract ignores it, leaving the buyer with the analysis and no leverage. The named failure mode is the eval that didn't make it to the contract.
The eval is ungameable when the AI vendor knows the decision rule but not the hidden eval set, and the contract gives the buyer remediation rights tied to private-eval regression.
If you start this week, pull the team's draft procurement clauses and add a Buyer Private Evaluation Set provision, a Model Change and Regression provision, and a Data Non-Use provision before any AI vendor sees the next round of negotiation.
Examples of what this looks like in production:
| Clause | What it does | Owner |
|---|---|---|
| Buyer Private Evaluation Set | AI vendor cannot see, train against, or score the held-out set | Procurement plus Legal |
| Model Change and Regression | Material regression triggers notice, cure period, and rollback rights | Legal |
| Data Non-Use | Inputs, outputs, evaluation materials, and labels excluded from training | Legal plus Security |
What the Human Owns Regardless of Tool
The function owner owns authorial intent. A platform can store examples; it can't know which failures the business can't tolerate, and it can't grade the ambiguity sitting in a customer's complaint.
Engineering owns dataset construction. A platform can store rows; it can't pull the right slice from production, redact sensitive fields, or build the stratification the team's actual customers require.
The applied-AI lead owns the taxonomy and the rubric. A model can judge against a rubric only after the rubric is written down, and writing the rubric is the work of the team that owns the budget for failure cleanup.
The applied-AI lead also owns the refresh cadence. A platform can schedule jobs; it can't decide when the market, the product, the policy, or the AI vendor have changed enough to invalidate the previous test.
Procurement and legal own the decision rule. A dashboard can show a score; the buyer has to decide whether that score means buy, reject, cure, renew, downgrade, or exit, and what the contract has to say to make any of those moves real.
Security owns the adversarial surface. A red-team plugin runs the attacks the security lead chose; the lead names which attacks the system actually faces in production, which canaries stay hidden, and which incidents flow back into the pool.
A team that hands taxonomy, rubric, cadence, and decision rule to the AI vendor stops owning procurement. The vendor sells the eval it can grade itself on, and the buyer is the only entity left that can decide what it actually buys.
Cost Calculus and Coexistence
Cheap on the platform line isn't cheap in procurement. Whether the team picks a free CLI tool with no human review and no procurement lock or a premium eval-of-record platform with no buyer-owned hidden set, the eval produces clean reports and an exposed buyer. The math below assumes a buyer running between 200 and 5,000 acceptance examples per procurement decision; enterprise programs push every number up.
Pricing sampled June 1, 2026:
| Platform | Entry tier | Mid tier | Enterprise |
|---|---|---|---|
| LangSmith | $0 (Developer, 1 seat, 5,000 traces) | $39 per seat per month (Plus) | Custom, self-host available |
| Braintrust | $0 (Starter, $10 credits, 1 GB) | $249 per month (Pro) | Custom, self-host data plane |
| Weave (W&B) | $0 (Free, evals + tracing) | Custom | Custom, Dedicated Cloud or Self-Managed |
| Promptfoo | $0 (Community CLI, 10,000 probes per month) | Custom (Enterprise SaaS) | Custom On-Prem |
| Inspect AI | $0 (open-source) | N/A | N/A (buyer-hosted) |
| Helicone | $0 (Hobby, 10,000 requests) | Usage-based (Pro, Team) | Custom |
| OpenAI Evals | Usage-based (API plus judge inference) | N/A | Platform-tied |
Then count what the platform line hides:
- Judge inference, which dominates the bill when LLM-as-judge runs at procurement scale
- Human review time, especially during calibration and double-labeling
- Self-hosting operations cost when the buyer runs a data plane internally
- Annotation vendor cost when outsourced labeling is required, with the leakage risk priced in
- Adversarial tooling time (Garak, Promptmap, Promptfoo red-team plugins) plus the security review they require
- Legal time for procurement-clause drafting and AI vendor negotiation
- The cost of being wrong, which is what the eval is supposed to prevent and rarely appears in the budget review until it has already happened
Five coexistence patterns capture most procurement setups:
- Single-platform engineering-only: best for small teams running one product surface. Pick Braintrust or LangSmith, run dataset, scoring, regression in one place. Spend lands at $0 to $300 a month; the human-review discipline lives in process, not tooling.
- Eval-of-record plus specialist red-team: best for most production teams with security exposure. Braintrust or LangSmith plus Promptfoo CLI or Garak in CI. Spend lands at $250 to $1,000 a month; adversarial coverage gets a separate review track.
- Eval-of-record plus human-labeling platform: best for teams with serious regulated workloads. Braintrust or LangSmith plus Label Studio or Argilla for inter-annotator agreement. Spend lands at $300 to $2,000 a month plus annotation labor.
- Self-hosted full stack: best for security-sensitive buyers and air-gapped environments. Inspect AI plus Label Studio plus Garak plus a self-hosted Braintrust or LangSmith data plane. Spend lands in operations time more than license fees; the savings is data residency, not dollars.
- Procurement-only acceptance program: best for buyers running annual AI vendor reviews more than continuous evaluation. Lightweight Promptfoo or Inspect AI for the bake-off, Label Studio for human review, and a contract template that ties acceptance to the eval results. Spend lands under $500 a month with the procurement load front-loaded.
Two platforms earn their seats when one carries the engineering eval workflow and the other carries the adversarial or human-labeling layer the first can't handle cleanly. They don't earn their seats when the second platform exists because nobody's owned the eval and "more tooling" feels safer than picking the discipline. A second platform can't decide which failures the business can't tolerate; it can only run the wrong eval with a different AI vendor logo.
Pitfalls and Anti-Patterns
Using a Public Benchmark as Procurement Proof
Treating MMLU, HumanEval, GSM8K, or a leaked AI vendor leaderboard as decision-grade evidence. Public benchmarks are scouting tools. Once they're famous enough to win on, they're famous enough to leak into training data and prompt-engineering folklore. The fix is the held-out acceptance set built on the buyer's production data.
Using the AI Vendor's Eval Set
Accepting the AI vendor's "internal benchmark" as a fair comparison. Even when the AI vendor is honest about the methodology, the set is vendor-authored evidence. A serious buyer treats it like the demo: useful for forming a hypothesis, never enough for signature.
Using an LLM Judge From the Same Vendor Family as the Model Under Test
Running OpenAI as the judge while evaluating OpenAI's GPT models, or Anthropic as the judge while evaluating Claude. MT-Bench research and OpenAI's own eval guidance both warn against the same-family judge pattern. The fix is to draw the judge from a different vendor family and calibrate against human-labeled samples.
Eval Set Too Small to Support the Decision
Running 30 examples and treating the result as procurement evidence. Eugene Yan's guidance uses the math: with 200 samples and a 3 percent defect rate, the 95 percent confidence interval is roughly 3 percent plus or minus 2.4 percentage points. 50 samples won't tell the team what it needs to know about the rare, catastrophic failures. The fix is sample sizing that matches the decision's blast radius.
Aggregate-Only Reporting
Reporting a single quality number and burying the per-category and per-severity detail. Aggregates make weak AI vendors look acceptable because catastrophic failures get averaged into the mean. The fix is procurement-shaped reporting that opens with the decision and preserves per-example failures underneath.
Leaking the Eval Through Shared Tools
Sharing the held-out set with the AI vendor's support team during a bug investigation. Sending the rubric to an outsourced annotation contractor without isolation. Uploading the prompt library to a shared cloud project. Sending the red-team report to the AI vendor's security desk. Each leakage channel turns the next eval into a less reliable test. The fix is named isolation boundaries and a written non-use clause in the vendor contract.
Letting the AI Vendor Tune Against the Acceptance Set
Treating "internal eval" and "private acceptance set" as the same thing. A development split the vendor sees is fine. An acceptance split the AI vendor never sees is structural. The fix is the three-way dataset split with hard storage boundaries.
Treating Eval as Post-Signing QA
Running the eval the first time after the contract is signed. By then the leverage is gone. The AI vendor's incentive shifts from winning the deal to managing the relationship. The fix is to make the eval the procurement document, not the post-procurement audit.
What to Validate Before Paying for the Stack
The pilot below tests the procurement program against a real AI vendor decision, not against an AI vendor demo. It produces measurable pass-fail gates and a defensible decision.
Before day one. Write the failure taxonomy with severity weights and signed-off zero-tolerance classes. Build the dataset and the three-way split. Decide which eval-of-record platform the team is willing to depend on, and confirm the held-out set lives in storage no AI vendor account can reach. Draft the procurement clauses in advance.
Week one: bake-off, scored honest. Run the current system, the candidate vendor, and one credible alternative against the held-out set. Calibrate every LLM judge against a human-labeled sample before allowing it into the acceptance run. Score per category, per severity, per example. Produce a procurement summary the function owner can read and decide on.
Week two: contract and decide. Map the eval results into the procurement clauses. Negotiate the acceptance threshold, the regression definition, the cure period, the data non-use language, and the exit rights. Don't sign until the contract reflects the eval. If the AI vendor refuses private evaluation, the AI vendor is failing procurement.
Buy only if the pilot wins. The pilot passes only when these gates hold:
- The held-out acceptance set scores meet the acceptance threshold for the candidate, with no Severity-1 failures
- The judge calibration cleared the agreement threshold against human labels
- The refresh cadence is set up and the production drift pipeline is running
- The contract reflects the acceptance threshold, the regression definition, the cure period, the data non-use clause, and the exit rights
- The team has a documented rollback path that returns to the prior system within the cure period
Fail the pilot if the eval platform can't:
- Show per-example inputs, outputs, judge rationale, and human labels for any failure
- Store the held-out set in isolation from the AI vendor's account
- Run the same eval against multiple candidates with identical scoring
- Surface regressions against a frozen baseline
- Export results into a procurement summary the function owner can act on without an AI vendor present
Methodology
Declared frame: evaluation is procurement, and the eval set is the buyer's evidence file. The dossier maps eight components of an ungameable eval against the tooling that supports each one, layers in pricing and isolation posture sampled June 1, 2026, and treats AI vendor benchmarks as adversarial evidence rather than friendly evidence. Sources consulted: vendor documentation and pricing pages for Braintrust, LangSmith, Weave, Promptfoo, Inspect AI, OpenAI Evals, and Helicone; LLM-as-judge research (MT-Bench, G-Eval, GPTScore, Prometheus); RAG-specific eval framework documentation (Ragas, DeepEval); human-annotation tooling documentation (Label Studio, Argilla, Snorkel); adversarial and safety frameworks (Garak, Promptmap, Promptfoo red-team, HELM Safety, MLCommons AILuminate); customer case studies (Dropbox on Braintrust, Notion on Braintrust, Surge AI on Anthropic); academic literature on benchmark contamination (OpenAI's GPT-4 Technical Report, HumanEval analyses); public procurement-clause guidance (Society for Computers and Law, Forum for Cooperation on AI, Shumaker benchmark-clause article); ARC Prize evaluation design as the public reference for semi-private and private acceptance scoring. In scope: procurement-grade evaluation programs for AI vendors covering models, agent platforms, copilots, support bots, and coding assistants where the buyer runs between 200 and 5,000 acceptance examples per decision and refreshes continuously from production. Out of scope: model-development evaluation for research labs, alignment evaluations against catastrophic risk thresholds (covered separately in lab responsible-scaling frameworks), and consumer-facing crowd-sourced rating systems.
Sources
- Braintrust — How Dropbox built an evaluation pipeline for AI search
- Braintrust — How Notion uses Braintrust to ship AI features faster
- Braintrust — Experiments and evals
- Braintrust — AI evaluations guide
- Braintrust — Self-hosting architecture
- Braintrust — Human review
- Braintrust — Pricing
- LangSmith — How to evaluate agentic applications
- LangSmith — Manage datasets in the LangSmith UI
- LangSmith — Dataset versioning
- LangSmith — Evaluation concepts
- LangSmith — Regression testing
- LangSmith — Pricing
- Promptfoo — Assertions and metrics
- Promptfoo — LLM rubric assertion
- Promptfoo — Red team configuration
- Promptfoo — Sharing eval results
- Promptfoo — Deployment options
- Promptfoo — Pricing
- OpenAI — OpenAI to acquire Promptfoo
- Inspect AI — Inspect AI documentation
- OpenAI — Evals
- OpenAI Cookbook — Evaluation guide
- OpenAI Help Center — Sharing feedback, evaluation and fine-tuning data, and API inputs and outputs with OpenAI
- OpenAI — GPT-4 Technical Report
- Weights & Biases — Weave evaluations
- Weights & Biases — Self-Managed deployments
- Weights & Biases — Pricing
- Helicone — Experiments
- Helicone — Open-source observability and integrations
- Helicone — Pricing
- Ragas — Available metrics
- Ragas — Align LLM as judge with expert judgments
- DeepEval — Introduction and metrics
- NVIDIA — Garak LLM vulnerability scanner
- Promptmap — Prompt injection and jailbreak testing
- MLCommons — AILuminate benchmark
- Stanford CRFM — HELM Safety
- Label Studio — Human consensus and inter-annotator agreement workflows
- Argilla — Argilla for human and AI feedback
- Snorkel AI — Programmatic labeling and labeling functions
- Surge AI — Anthropic RLHF case study
- Reuters — Google, Scale AI's largest customer, plans split after Meta deal, sources say
- Zheng et al. — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Liu et al. — G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
- Fu et al. — GPTScore: Evaluate as You Desire
- Kim et al. — Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
- Zhang et al. — Examining Coding Performance Mismatch on HumanEval
- Society for Computers and Law — AI Group launches Artificial Intelligence Contractual Clauses
- Forum for Cooperation on AI — Risk Management Framework for Procuring AI Systems
- Shumaker — The Artificial Intelligence Benchmark: The Most Important Clause You've Never Used, Part 1
- Colin S. Levy — Contracting with AI Vendors: A Practical Guide for Lawyers
- Eugene Yan — Product evals for LLM applications
- ARC Prize — ARC-AGI private and semi-private evaluation design
Tools Mentioned
- Braintrust — Eval-of-record platform with datasets, custom and LLM-judge scorers, structured human review, immutable experiments, and self-hosted data plane for sensitive workloads. Starter free, Pro $249 per month, Enterprise customBraintrust
- LangSmith — LangChain's eval and observability platform with offline and online evals, dataset versioning, regression testing, and hybrid or self-host options on Enterprise. Developer free, Plus $39 per seat per month, Enterprise customLangSmith
- Weights & Biases Weave — Evaluation, tracing, and scoring inside the broader W&B stack with Dedicated Cloud and Self-Managed deployments. Free tier, Enterprise customWeights & Biases Weave
- Promptfoo — Config-driven eval and red-team tool with broad assertion library, CI integration, and On-Prem deployment. Community CLI free with limits, Enterprise SaaS custom, On-Prem custom. Acquired by OpenAI in 2026Promptfoo
- Inspect AI — UK AISI's open-source eval framework for code-defined tasks, agent evals, tool calling, sandboxes, and tool approval. Open-sourceInspect AI
- OpenAI Evals — OpenAI-hosted evals via dashboard and API, plus the open-source Evals repository. Usage-priced via OpenAI APIOpenAI Evals
- Helicone — Observability, gateway, and cost tracking platform; Experiments feature removed September 1, 2025. Hobby free, paid tiers usage-basedHelicone
- Ragas — RAG-specific eval framework covering context precision, context recall, faithfulness, response relevancy, tool-call accuracy, and agent goal accuracy. Open-sourceRagas
- DeepEval — Pytest-style LLM eval framework with G-Eval, DAG, QAG, and custom metrics. Open-sourceDeepEval
- Label Studio — Human-labeling platform with consensus workflows, inter-annotator agreement, and ground-truth export. Open-source core, Enterprise customLabel Studio
- Argilla — Feedback-dataset platform for domain experts, AI engineers, and annotators across NLP, LLM, RAG, and preference workflows. Open-source coreArgilla
- Snorkel AI — Programmatic labeling and weak supervision with labeling functions. EnterpriseSnorkel AI
- Garak — NVIDIA's open-source LLM vulnerability scanner covering hallucination, prompt injection, data leakage, misinformation, toxicity, and jailbreaks. Open-sourceGarak
- Promptmap — Open-source prompt-injection and jailbreak testing tool with controller-LLM scoring. Open-sourcePromptmap
- MLCommons AILuminate — Public safety benchmark covering 12 hazard categories across text and multimodal tasks. ReferenceMLCommons AILuminate
- HELM Safety — Stanford CRFM's public safety benchmark covering violence, fraud, discrimination, sexual content, harassment, and deception. ReferenceHELM Safety
- Surge AI / Scale AI — Outsourced human-annotation vendors for RLHF, evaluation labeling, and rubric calibration. Pricing customSurge AI / Scale AI
- Prometheus 2 — Open judge model for evaluating other language models, with task-specific calibration. Open-sourcePrometheus 2
Share


