[ATLAS]May 21, 202620 min read

AEO Vendor Selection by Falsifiability

A reference dossier for the CMO, growth lead, or founder running an RFP across three or four AEO vendors: the six artifacts a measurement frame requires, the vendors that can produce each artifact today, and where the buyer has to step back in regardless of which platform shows up on the demo.

AEO vendor evaluation: the falsifiable measurement test

TL;DR

AEO vendors aren't one category; they sell measurement, monitoring, prompt discovery, content optimization, technical SEO, and revenue attribution under the same label. The decision isn't "which AEO vendor is best," it's what specific claim you're asking the vendor to prove and which test frame would falsify the claim. Most vendors can support monitoring; only some can support a defensible prompt and model frame; none should be allowed to define the business claim. For a brand running a serious AEO pilot, the appropriate spend is roughly $200 to $2,500 a month for measurement, plus $5,000 to $20,000 a month if optimization or agency execution is in scope.

AEO Vendors Are Not One Category

A $90M ARR B2B SaaS company runs an RFP across four AEO vendors and signs the contract on the strength of a clean chart: "AI visibility up 38 percent in 30 days." After the deal closes, the team asks for the exact prompt list and run dates and discovers the list isn't static; the vendor says prompts are "dynamically generated to reflect the market." When the marketing team runs the same query set against the company's own buyer-intent prompts, the lift disappears. The company didn't buy a measurement system; it bought a chart.

A skincare brand signs an $18,000-a-month optimization retainer. Over the next quarter the agency overhauls product category pages, layers FAQ schema onto the catalog, and publishes three head-to-head comparison articles. The dashboard climbs: AI citations up 2.5x, the ecommerce team is ready to renew. Then a junior analyst pulls the dashboard's own audit log and finds something unsettling. During the same window the vendor had added Gemini and Google AI Overviews to the measurement panel, expanded its prompt database into high-volume informational queries, and category-wide AI traffic was up across every benchmark the agency tracks. Nobody had carved off a held-out group of untouched pages to compare against. The optimization work might have moved the chart; the dashboard couldn't say.

A vertical SaaS founder watches weekly citation rate climb from 6 percent to 19 percent and tells the board the new AEO content cluster is paying off. The board comes back with one question: why hasn't pipeline moved with the citations? The answer is in the CSV. Over the same window the dashboard provider had quietly broadened its measurement panel from mostly-ChatGPT to a mix that included Perplexity and Google AI Overviews, both of which cite source URLs far more readily. The brand's citation share didn't go up. The measurement floor did.

Three teams, three failures, one category mistake. AEO hasn't become one tool; it has become a stack of single-claim tools that share a measurement label.

A measurement frame that doesn't disclose its prompt set, its model mix, its timestamp, its control, and its claim shape isn't measurement; it's a chart. Monitoring citations doesn't automatically improve them, improving mentions doesn't automatically prove qualified traffic or conversion, and reporting AI-referred conversions doesn't automatically prove the vendor's actions caused them. The mistake isn't the vendor, it's the absence of a falsifiable test frame.

The right question isn't which AEO vendor is best, because the market sorts itself differently every quarter as vendors add surfaces, change pricing, and adjust their methodology pages. The right question is sharper: what claim am I asking this vendor to prove, what evidence would falsify it, and which category of vendor can actually produce that evidence today?

Six artifacts matter for a meaningful AEO test frame: the prompt set, the surface and model mix, the timestamp and refresh cadence, the control, the claim shape, and the bill of materials. Each has a vendor or vendor category that produces it cleanly today, a ceiling beyond which the public evidence stops being convincing, and a return point where the work belongs to the brand regardless of platform.

Figure 1 — Where each vendor passes the falsifiability tests today. Filled dots mark a test the vendor passes cleanly in its public evidence; hollow dots mark partial or conditional evidence; empty cells mean no public evidence on that test.

A Four-Week Pilot

The temptation when the AEO market moves is to sign with the vendor whose chart climbed the fastest, but every vendor produces a chart that climbs. The pilot below trades the demo chart for an evaluation that survives the contract.

Week 1: Lock the prompt set. Before any vendor is paid, the team writes a private prompt set of 70 to 100 entries (at minimum 30 buyer-intent queries, 20 category and comparison queries, 10 risk and pricing queries, and 10 brand-name queries), each tagged with source, intent, geography, and language. By Friday, the prompt set is frozen and the team can answer one question: "if a vendor disclosed nothing else, would this set let us judge their measurement?"

Week 2: Decompose the model mix. Each candidate vendor returns a model-mix ledger covering AI surface queried, capture path, country, language, run cadence, first run date, last run date, whether each surface is included in aggregate scores, and how panel changes are annotated. By Friday, the team has scored each vendor on whether their aggregate score can be decomposed by surface, and anything that can't is rejected for the measurement role.

Week 3: Hold a real control. The team selects 30 prompts from the frozen set to receive vendor action (content updates, schema, FAQ blocks, whatever the vendor proposes) and holds the other 30 to 70 prompts as control. Both groups run on the same model mix on the same cadence, and the team logs the timestamp and the model panel for the whole window. By Friday, the team knows what would count as a passing result and what would count as market drift.

Week 4: Lock the claim shape. The team picks the claim that matters to the business (citation share inside the frozen set, conversion-relevant citation only, AI-referred traffic on monitored URLs, or pipeline lag) and requires the vendor to produce evidence against that specific claim shape, not against every chart it can generate. By Friday, the team decides whether the vendor moved the chosen claim on the action group beyond market drift in the control group.

Skipping a week is permitted. Skipping Week 1 isn't. The rest of this dossier walks the six artifacts the pilot rests on, names the vendors that produce each one cleanly today, and inventories the cost pattern that earns its keep.

The Six Falsifiability Tests

The artifact decides the vendor, not the other way around. Each test below gets the same treatment: who owns the slot, what ships clean from the vendors that produce it, where the public evidence stops being convincing, and the action plan for week one of an RFP.

The Prompt Set

The slot belongs to the brand. The vendor can suggest prompts, mine prompts, or cluster prompts, but it can't be allowed to define the prompt set that proves the business claim. A clean prompt set carries exact text, prompt source, intent tag, brand or non-brand tag, geography, language, an inclusion rule, and a flag for whether prompts are frozen for the pilot or rotating.

What ships clean:

Otterly lets customers define a prompt library and runs it across engines.
Peec AI lets customers track prompts daily and segment by model, country IP, prompt tag, persona, and funnel stage.
Ahrefs Brand Radar supports custom prompts with operator-selected AI assistants, location, and refresh frequency.
Profound lets customers upload prompts and filter by prompts, platform, brand, region, language, and date.
Goodie AI recommends prompts at onboarding but lets the customer select, adjust, and swap them.
Semrush Position Tracking accepts a custom set of prompts alongside its database.

The ceiling appears at vendor-generated prompt databases. Semrush and Ahrefs each ship strong synthetic prompt sets that are useful for discovery, but they aren't automatically the brand's test frame. A prompt database can answer "where is the brand visible across a broad synthetic market?" It can't answer "did our AEO program improve buyer-relevant citation rate?" unless the buyer-relevant subset is frozen, labeled, and held separately from the discovery database.

If you start this week, write an RFP question that asks every candidate vendor for a CSV containing the prompt text, the source of each prompt, an intent tag, and a clear yes/no on whether the set will be held constant during the pilot. Walk away from any vendor whose answer is that prompts are proprietary, dynamic, or only visible inside the dashboard.

The Surface and Model Mix

The slot belongs to the vendor, but the decision belongs to the brand. The dozen-or-so answer engines an AEO platform might query (ChatGPT, Perplexity, Gemini, Claude, Google AI Overviews and AI Mode, Copilot, Grok, Meta AI, DeepSeek, Amazon Rufus, and the long tail) behave very differently. Some cite source URLs aggressively; some surface brand mentions without any link at all; some lean on retrieval and some don't. Rolling all of that into a single visibility number erases the mechanism the team is trying to read.

What ships clean:

Profound publishes a broad surface list (ChatGPT, Perplexity, Claude, Gemini, Google AI Overviews, Copilot, Grok, Amazon Rufus, Meta AI, and DeepSeek) and emphasizes that its capture path runs against the real user-facing answer engines rather than through API endpoints.
Peec AI exposes per-model selection and daily runs, with country-IP filtering as a first-class option.
Goodie AI tiers its model coverage explicitly. Lower tiers cover the major surfaces (ChatGPT, AI Overviews, Perplexity, Gemini, Copilot, Rufus), and higher tiers add Claude, AI Mode, Meta, DeepSeek, and Grok.
Ahrefs documents the indexed surface set behind Brand Radar and its custom-tracking surfaces.
Semrush AI Visibility lists ChatGPT, Gemini, Perplexity, Google AI Overviews, and AI Mode as the covered set.
Otterly's stated surfaces are ChatGPT, Google AI Overviews, Perplexity, and Copilot, with Gemini and AI Mode available as add-ons.

The ceiling appears at model version pinning. Public vendor docs disclose surfaces, not exact model versions, and that's partly unavoidable: consumer-facing answer engines don't expose stable version IDs. It isn't an excuse to hide panel changes; at minimum, the vendor must annotate when a surface was added, removed, or materially changed during the measurement period.

If you start this week, ask each candidate for a model-mix ledger covering surface, capture path, country, language, run cadence, first run date, last run date, and whether the surface is included in aggregate scores, and reject any aggregate score that can't be decomposed by surface.

The Timestamp and Refresh Cadence

The slot belongs to the measurement platform. AEO visibility decays, and it also jumps when models update, retrieval behavior changes, crawlers re-index, or the vendor adds a surface. A screenshot without a timestamp is not evidence; a score without historical raw outputs is not an audit trail.

What ships clean:

Peec AI commits to a 24-hour refresh interval as part of its standard offering.
Profound positions Answer Engine Insights as a daily-tracking product with date-filtered views and raw CSV export.
Otterly markets daily tracking across its current paid tiers.
Ahrefs's custom prompt tracking offers daily, weekly, or monthly cadences depending on the package.
Semrush splits the cadence by product: Brand Performance refreshes weekly, while custom Position Tracking supports daily updates against the team's own prompts.
Goodie AI maps refresh cadence to its pricing tiers, with daily collection on the higher tiers.
seoClarity describes week-over-week visibility changes and a real-time citation monitor as part of its AI search surface.

The ceiling appears at vendor-side panel changes that aren't annotated. The brand's chart can jump 10 points overnight because the vendor added a new surface or recalibrated capture, and without an annotated change log, the dashboard becomes uninterpretable across vendor releases. The vendor must publish a change log of measurement-affecting updates.

If you start this week, ask for the vendor's measurement-affecting change log for the last twelve months plus a screenshot of how panel changes are surfaced inside the dashboard. If panel changes don't appear, the dashboard isn't reporting; it's painting.

The Control

The slot belongs to the brand. When the vendor claims "lift," the team needs to see the unmoved version. Without a frozen control group of prompts or pages that didn't receive vendor action during the same period, the chart is correlation, not causality.

What ships clean: very little, publicly. Most vendor case studies report before-and-after charts inside the customer's account without a frozen control:

Peec AI's public materials show clean methodology and customer outcomes but don't constitute causal proof.
Profound's published case studies show outcomes without independent control.
Goodie AI's customer growth charts are presented without control.
BrightEdge publishes research with explicit measurement windows and sample sizes, which is the closest thing to a public control posture in the enterprise SEO category.

The ceiling appears at the optimization layer. End-to-end AEO platforms that combine measurement with content and schema changes can rarely produce a clean control without restructuring the engagement: optimize half the prompt set, hold the other half. Most vendors aren't set up to deliver work this way, and most customers don't ask.

If you start this week, write the control into the engagement before signing. The contract should state that 30 prompts receive vendor action, 30 to 70 prompts are held as control, both groups run on the same model mix on the same cadence, raw outputs are stored on the brand's side, and the vendor produces a control-versus-action delta as the deliverable rather than a top-line chart.

The Claim Shape

The slot belongs to the brand. AEO claims sound similar but aren't interchangeable: mention rate, citation rate, citation rank, share of voice, sentiment, AI-referred traffic, conversion, and causal lift each have different proof burdens, and a vendor that has proof of one is often selling proof of all of them.

What ships clean: vendors with narrow, declared claim shapes survive scrutiny better than vendors with broad ones.

Peec AI focuses on measurement and segmentation.
Profound focuses on Answer Engine Insights with Agent Analytics for AI-driven traffic capture.
Otterly focuses on daily tracking and operator-defined prompts.

These are measurement claims; they don't promise causal lift.

The ceiling appears at conversion and revenue attribution. Athena HQ's public pages claim ROI and lead generation outcomes, Goodie AI's customer growth charts imply attribution, and Daydream sells implementation against pipeline outcomes. These claim shapes are the highest-burden ones in the category and the least likely to be publicly defended, so treat any conversion-attribution chart as a hypothesis, not a result, until the vendor produces a frozen control and a versioned model mix.

If you start this week, write down the single claim the business needs proven, with concrete examples like "buyer-intent citation share inside our frozen set rises 25 percent over four weeks while the control set stays flat," or "AI-referred sessions on monitored URLs rise 40 percent and convert at parity with organic." Reject any chart that doesn't speak directly to that claim.

The Bill of Materials

The slot belongs to the vendor's disclosure. AEO vendors fall into three categories: measurement-first platforms (Profound, Peec AI, Otterly, Ahrefs Brand Radar, Semrush AI Visibility), enterprise SEO platforms with AEO surfaces (BrightEdge, seoClarity, Conductor, Semrush, Ahrefs), and end-to-end optimization and service layers (Athena HQ, Goodie AI, Daydream, Writesonic). Each requires a different test.

What ships clean: measurement-first platforms publish clear bill-of-materials sections.

Otterly lists prompt management, surface coverage, refresh cadence, and exports.
Peec AI lists prompt management, segmentation, model selection, exports, and integrations.
Profound publishes Answer Engine Insights, Conversation Explorer, Agent Analytics, and Citation Analyzer as discrete products.

The ceiling appears at suite-level abstraction. Enterprise SEO platforms (BrightEdge, seoClarity, Conductor) deliver AEO inside a larger toolset, and AEO can be buried inside dashboards designed for traditional SEO; the team has to insist the AEO surface be exported and audited separately. End-to-end service layers blur the line between measurement and implementation, and the deliverable becomes whatever the agency decides it is unless the contract names artifacts.

If you start this week, require a one-page bill of materials from every candidate covering measurement, monitoring, prompt discovery, content recommendations, content generation, technical SEO, schema work, crawler analytics, agency execution, and revenue attribution, with each line marked as "in tier," "add-on," or "not offered." Tier-mix surprises are the most common reason an RFP comparison fails after signing.

Where the Operator Steps Back In Regardless of Vendor

The line isn't "when the vendor's chart looks wrong." Charts look wrong all the time, and the team catches it the same way it always has. The line is when the vendor is being asked to prove something only the brand can prove: that the prompts represent the business, that the claim represents the buying process, and that the lift represents customer behavior rather than market drift. AEO platforms produce default-plausible measurement across every surface, which is fine for discovery and category mapping but fails at the surfaces where the business claim or the budget defense is the differentiator.

The growth lead owns the prompt set. A vendor's discovery prompts answer a synthetic market question, not the brand's market question; that one requires a frozen, labeled, buyer-intent subset that someone inside the company wrote. Profound, Peec AI, Otterly, Ahrefs, and Semrush each let customers upload custom prompts, but whether the team uses that feature and whether the buyer-intent subset is held separate from discovery is an inside decision.

The team owns the model-mix decision. Profound captures from front ends, Peec AI captures via direct query, and Semrush and Ahrefs index aggregate surfaces, and each method produces different citation rates for the same brand on the same prompts. Aggregating them isn't a measurement; it's a choice that needs to be made explicitly and documented. The vendor lists the surfaces; the team decides which ones the business actually cares about.

The CMO owns the claim shape. "AI visibility" isn't a business claim; "buyer-intent citation share on prompts that map to current-quarter pipeline" is. The vendor will report the broadest claim the dashboard supports, and the business narrows it to the one that the renewal will be judged against. Athena HQ, Goodie AI, and Daydream each report multiple claim shapes, and only the buyer can pick which one the contract is judged by.

Trademark, factual accuracy, and reputational risk stay with the brand. AEO platforms surface what models are saying about the brand, but they don't fix incorrect, defamatory, or trademark-violating content; the brand team and legal team own takedown requests, source-page corrections, and engagement with answer-engine support channels. The platform reports; the brand responds.

Finance owns the budget defense. AEO spend has no industry-accepted unit economics, and a pilot that climbs a chart isn't automatically a pilot that earned its line item. The defense is the frozen control delta, the buyer-intent citation lift, the AI-referred conversion comparison, or the pipeline contribution evidence; it isn't the dashboard screenshot. Whoever signs the renewal owns that evidence, not the vendor that produced the chart.

Cost Calculus and Coexistence

The candidate paid stacks for an AEO program:

The minimal measurement stack: Otterly Pro at $189 per month or Peec AI base tier (roughly $200 a month equivalent), plus an in-house prompt set. Total around $200 a month, measurement only, with the team owning the prompt set and the claim shape. Right starting point for any brand running its first AEO pilot.
The serious measurement stack: Profound Enterprise (custom pricing, typically $2,500 to $7,500 a month at the enterprise tier), plus an in-house prompt set, plus an internal analyst owning the model-mix ledger. Right starting point when AEO measurement is treated as a strategic surface rather than a content-marketing add-on.
The integrated SEO stack: Semrush One Starter at $199 a month, Pro+ at $299, or Advanced at $549, or Ahrefs Brand Radar AI starting at $199 with custom prompt packages from $50 a month. Right starting point when AEO measurement must live inside the existing SEO operating system.
The enterprise platform stack: BrightEdge or seoClarity at custom enterprise pricing, typically starting at $2,500 a month for seoClarity and custom for BrightEdge. Right starting point when the brand already runs enterprise SEO and AEO is one workflow among many.
The optimization or agency add-on: Athena HQ from $295 a month self-serve, Goodie AI custom, Daydream service retainer commonly $15,000 a month and up. These are execution layers, not measurement layers; they don't replace the measurement stack, they sit on top of it. Buy only if the team has the bandwidth to monitor a control group and the contract names a controlled deliverable.

Don't pay an end-to-end optimization layer (Athena HQ, Goodie AI, Daydream) without a separate measurement-first vendor capturing the control group. The two have to be different vendors, or the same vendor has to commit in writing to a held-out control; otherwise the optimization layer is measuring its own work.

Pitfalls and Anti-Patterns

Buying on Demo Charts

Every vendor's demo chart climbs because the chart is built on the vendor's prompt set, often on prompts the vendor knows the brand has content for, against a panel of surfaces the vendor knows generate citations. Demand the same chart against your own prompt set on your own model mix before the contract is signed.

Mistaking Optimization for Causality

A vendor that rewrites pages, adds schema, and ships comparison content while the dashboard climbs has not proven its work caused the climb. Anything from a model update, a panel change, a category-wide AEO trend, or seasonal traffic patterns could explain the move. Causal claims require a frozen control; treat optimization spend as a hypothesis until the control delta is reported.

Letting the Vendor's Model Mix Change Without Notice

The vendor adds Gemini coverage in March, recalibrates the Perplexity capture in May, and ships an AI Mode integration in July, and the dashboard score moves at each step. None of those moves are evidence of brand performance, and without a measurement-affecting change log, the dashboard is uninterpretable across releases.

Confusing Share of Voice With Conversion-Relevant Citation

Share of voice across a synthetic market is a discovery metric; conversion-relevant citation on buyer-intent prompts is a business metric. The vendor's flagship chart usually reports the first because the numerator is larger and easier to move, but the renewal conversation has to be about the second.

Letting the Vendor Define the Metric

If the vendor's dashboard names the metric, the dashboard's success criteria are baked into the metric. "AI Visibility Index" is a vendor-defined composite, not a business claim. Pick the business claim before the RFP, and force every vendor to report against that claim rather than its preferred composite.

Paying for a Suite When the Team Needs a Specialist (or Vice Versa)

A two-person growth team paying for BrightEdge enterprise wastes nine-tenths of the platform, while a 400-person SEO program running on Otterly Pro can't scale operations through the platform's workflow. Match the tier to the team and the surface area; tier-mix mismatch is the most common AEO procurement failure that survives the demo.

What to Validate Before Paying for the Stack

The pilot above tells the team what to run. The criteria below tell the team what to accept. Three pass/fail rubrics, one for each decision point.

Vendor-disclosure pass/fail. Each candidate vendor must produce, before any contract is signed, a CSV of the proposed prompt set with text, source, intent tag, and a frozen-versus-rotating flag, a model-mix ledger with surface, capture path, country, language, cadence, first run, last run, and aggregate-score inclusion, a sample raw-output export for one prompt across the disclosed surfaces, a measurement-affecting change log from the last twelve months, and a clear bill of materials marking each line item as in-tier, add-on, or not offered. Eliminate any vendor that withholds any of the five.

Pilot pass/fail. The action group's claim metric must beat the control group's by more than the surface-volatility floor, typically 10 to 15 percent on weekly citation share without an annotated panel change. Below that floor is market drift, so the vendor didn't move it. Above the floor with an annotated panel change is contaminated, so the vendor moved with the market. Above the floor without an annotated panel change is the only outcome that counts.

Renewal pass/fail. The renewal conversation runs on three artifacts: the four-week control-versus-action delta on your declared claim, the surface-decomposed score showing which engines drove the change, and the cost-per-pipeline-conversion or cost-per-claim-met evidence the team can defend to finance. Anything outside those artifacts is dashboard theater.

Key Takeaways

AEO vendors aren't one category; they're three, and the vendor type has to match the claim type.
The prompt set, the model mix, the claim shape, and the budget defense belong to the buyer regardless of vendor.
A measurement frame without a prompt set, a model mix, a timestamp, and a control isn't measurement; it's a chart.
Vendor-side panel changes can move a dashboard score without any change in brand performance, so demand a change log.
Optimization layers don't replace measurement layers; they sit on top of them and need a separate control.
"AI Visibility Index" and similar vendor composites aren't business claims, so pick the claim before the RFP.
A pilot that climbed a chart but didn't beat a control didn't pass, and the renewal conversation has to run on control deltas rather than dashboards.

Methodology

This dossier evaluates ten primary vendors (Profound, Athena HQ, Goodie AI, Otterly, Daydream, Peec AI, BrightEdge, seoClarity, Semrush AI Visibility, Ahrefs Brand Radar) plus four adjacent vendors (Writesonic AI Visibility, Conductor, AIPRM, plus enterprise SEO platforms with AEO add-ons). Pricing was sourced from vendor pricing pages, public comparison guides, and tier documentation accessed 21 May 2026, and measurement posture was sourced from vendor methodology pages, public case studies, and security and compliance disclosures. Public evidence isn't the same as RFP evidence; a vendor that doesn't pass cleanly in this dossier may still pass in a private procurement process by producing artifacts under NDA that aren't posted publicly. The dossier doesn't run independent benchmarks; it reports the disclosure posture each vendor publishes and the buyer-evaluation tests required to confirm that posture inside a real pilot.

Sources

Tools Mentioned

LinkedIn X Email

AEO Vendor Selection by Falsifiability

TL;DR

AEO Vendors Are Not One Category

A Four-Week Pilot