Boundary mapMay 9, 2026

The Router Is The Deployment

Where the line sits between premium frontier APIs, hosted open-weight endpoints, and self-hosted stacks. Token price is a procurement metric; cost per resolved task is what actually moves.

Why the boundary is the whole game

The most important model question for an operator in 2026 is not "which model is best." It is which workloads belong on which tier of the model layer.

The default of 2024 and 2025 was simple. Send everything to OpenAI, Anthropic, or Google. Wrap it in a retrieval layer. Build governance around prompts, tools, and access. Self-hosting was for residency-bound buyers, deep-infra teams, or unusual volume.

That default is now weaker for a reason that is easy to misread: it isn't that self-hosting got cheap, it's that hosted open-weight endpoints did. DeepSeek V4-Flash lists at $0.14 input and $0.28 output per million tokens. Groq lists Llama 4 Scout at $0.11 input and $0.34 output. GPT-OSS-120B is on Together at $0.15 input and $0.60 output. The cliff is real, and it's hosted, not DIY.

Figure 1 — The cliff in dollars per million output tokens. Listed output prices as of April 30, 2026. The cliff is between the cheapest premium tier and the priciest hosted open-weight model.

That makes the decision a partition problem. Premium frontier models still earn their price in difficult agentic work, hard reasoning, long-horizon tool use, customer-facing outputs, and regulated procurement. Open-weight models served by a third party have moved into "good enough for many production tasks" territory at 5x to 50x less per call. Self-hosting on rented GPUs sits underneath, but only earns its slot when utilization is high, evals are real, and someone owns incident response. Three tiers, not two. The operator's job is to send each workload to the tier that minimizes cost per resolved task, not cost per token.

Settled doctrine doesn't exist yet. Vendors converge on the operational pitch (cheap is good for batch, premium is good for agents) but disagree on where the line sits.

Cost-first operators move aggressively to hosted open-weight and discover the cliff was real on summarization and unreal on tool-heavy customer support. Control-first operators stay on premium APIs and pay a tax on workloads that would have run fine on Qwen3.5 or DeepSeek V4-Flash. Capability-first operators chase frontier benchmarks and end up with a bill that doesn't track any business outcome. Governance-first operators build self-hosted stacks because their procurement team wants control, then discover that observability and eval debt is bigger than the GPU bill.

A usable operator framework keeps all four pressures in view. The piece below names the three tiers, the four failure modes that show up when the line is drawn wrong, and a five-step methodology for partitioning workloads against your own evidence. The router is the deployment. The model is just what executes the call the router already made.

If you read nothing else

Where to start

If you run one experiment to test the partition this week, run this. Pick your highest-volume bounded workload (classification, extraction, summarization, or first-pass support drafts) and put it head-to-head against a hosted open-weight endpoint. The output is a router rule or the receipts to defend the line you're already on.

Sample 100 calls. Pull from the last week of production traffic on the workload you picked. Same inputs, same context, same retrieval.
Pick the endpoint. Sign up for Together, Fireworks, Groq, or the DeepSeek API and get a key. DeepSeek V4-Flash, Together GPT-OSS-120B, or Groq Llama 4 Scout will all do for the first cut.
Run in parallel. Send the same 100 inputs to both your frontier model and the hosted endpoint. Capture outputs, latency, and token counts on each.
Score the outputs. Use a task-level metric you actually care about: accuracy on extraction, factual fidelity on summary, escalation rate on first-pass support drafts. Human review is fine at this volume.
Compute cost per resolved task. Total cost divided by count of acceptable outputs. Compare both tiers on the same number, not on cost per token.
Decide. If the hosted tier holds quality at materially lower cost per resolved task, that's the first router rule. If not, you have the receipts to defend the current line and a documented bound on what doesn't work.

Everything below explains why and how to scale it.

The three tiers of model access

The three tiers are not vendor categories. They are decisions the operator makes about each workload before the prompt is ever sent. Two checks usually resolve the partition: is the failure cost bounded, and is the volume material. The figure below shows the flow. The three subsections that follow describe each tier.

Figure 2 — The router as three decisions. The partition is three checks, in order: residency first, failure cost second, volume third.

Premium frontier: keep the call here when failure is expensive

Principle: route to a premium frontier API when failure cost exceeds plausible savings, when the task is tool-heavy or agentic, when the output is customer-facing, when the reasoning is ambiguous, or when enterprise procurement requires controls hosted open-weight providers cannot yet match.

The current frontier set is GPT-5.5, GPT-5.5 Pro, Claude Opus 4.7, Claude Sonnet 4.6, and Gemini 3.1 Pro, with mid-tier managed siblings (GPT-5, GPT-5.4 mini, Claude Haiku 4.5, Gemini 3.1 Flash-Lite) sitting one rung below. Public list prices put Opus 4.7 at $5 input and $25 output per million tokens, GPT-5.5 at $5 and $30, and Gemini 3.1 Pro at $2 and $12 under 200K context. These are the rates the cheap-tier critics use to argue the frontier is overpriced. They are also the rates that buy you SOC 2, BAAs, zero-retention terms, regional residency, batch and flex tiers, prompt caching at 0.1x base read price, support contracts, abuse monitoring, and the integration surface that turns a model into a deployable product.

Workloads that earn this tier share a pattern. The model has to be right enough on the first or second attempt because retries, escalations, or bad tool calls cost more than the token spread. Complex agentic coding sits here: GPT-5.5 reports 58.6% on SWE-Bench Pro and 82.7% on Terminal-Bench 2.0; Opus 4.7 leads on SWE-Bench Pro and MCP Atlas; Gemini 3.1 Pro reports strongly across Humanity's Last Exam, ARC-AGI-2, and APEX-Agents.

Customer-facing replies for sensitive or unbounded conversations sit here. Long-horizon agents that hold tool access sit here. Hard reasoning where you can't measure quality cheaply sits here. So does anything where the buyer is enterprise and the contract artifacts (DPA, BAA, ZDR, audit logs) are the deliverable, not just the answer.

The mistake to avoid is treating the frontier as the default for everything. A premium model is still a procurement choice. It earns its price on the workloads above and overpays on the workloads in the next two sections.

Hosted open-weight: the structural middle

Principle: route to a hosted open-weight endpoint when volume is material, the task is bounded, the output structure is stable, and quality can be measured. Hosted means a third party operates the GPUs, signs a privacy term, and exposes an OpenAI- or Anthropic-compatible API. You skip the GPU bill and the SRE bill in exchange for less-frontier capability and tighter governance diligence.

This is the tier that reopened the buy-build-host conversation. DeepSeek V4-Flash at $0.14 input and $0.28 output is the structural price point that matters. V4-Pro's 75% promo through May 31, 2026 is real but temporary, and operators should not build long-term economics around a discount that will end.

Together hosts Qwen3.5-397B-A17B at $0.60 input and $3.60 output and GPT-OSS-120B at $0.15 input and $0.60 output. Fireworks hosts DeepSeek V4-Pro at $1.74 and $3.48. Groq lists Llama 4 Scout at $0.11 and $0.34 and Qwen3 32B at $0.29 and $0.59. None of these prices replace frontier capability. They replace frontier billing on workloads frontier capability didn't earn anyway.

Workloads that fit the hosted middle are familiar: classification, extraction, document summarization, RAG synthesis, internal search, batch back-office automation, draft code explanation, meeting-note processing, and first-pass routing on support tickets that will hand off to a human or a frontier model when the bot can't resolve. The four enabling conditions are stable: high volume so the cliff matters, measurable quality so you can detect regressions, bounded failure cost so a wrong answer doesn't blow up the relationship, and prompt and output structure stable enough that a cheaper model can hit the spec without bespoke prompting per call.

The diligence work that makes this tier safe is real. A cheap endpoint is not a compliance strategy. The provider's retention terms, abuse logging, region, SOC report, deletion path, subprocessor list, incident obligations, and policy on using customer outputs for provider evaluation all matter. The same nominal model can also behave differently on a first-party API, on a hosted third-party endpoint, and in a locally quantized deployment; latency, context behavior, tool formatting, refusal patterns, and even token counts can drift between them. Pick the endpoint per workload, not per vendor.

Self-host: only when something forces it

Principle: route to a self-hosted open-weight stack only when residency, contract, or air-gap policy blocks third-party inference, or when volume and utilization are high enough that the GPU economics beat hosted endpoints and your team can actually operate the runtime, the evals, and the incident response.

The serving layer has matured. vLLM, SGLang, and llama.cpp now expose OpenAI-compatible servers, Anthropic Messages API compatibility, continuous batching, prefix caching, paged attention, quantization, and tool-calling support. Runpod lists serverless H100 PRO at roughly $0.00116 per second flex and $0.00093 per second active. Lambda lists H100 instance prices from $3.29 to $4.29 per GPU hour. Together lists on-demand HGX H100 clusters at $3.49 per GPU hour. The infrastructure is no longer the gating problem.

The economics still are. A $3.50 per hour GPU sustaining 800 output tokens per second across batched traffic produces 2.88M output tokens per hour, or about $1.22 per million output tokens at full utilization. At 30% utilization that becomes $4.05 per million. If the model only sustains 200 tokens per second at longer context, the same GPU runs $4.86 per million at full utilization and $16.20 at 30%. Self-hosting can beat the frontier price; it does not automatically beat DeepSeek V4-Flash, Groq Llama 4 Scout, or hosted GPT-OSS-120B unless utilization, batching, and reserved-GPU pricing are all in your favor.

The operations bill is the part that gets underweighted. Self-hosting buys you ownership of model serving, autoscaling, queue management, model loading, quantization, context configuration, tokenizer compatibility, model versioning, rollback, canary tests, health checks, rate limiting, abuse controls, secrets management, observability, evals, fallback routing, and incident response. The minimum eval suite alone (golden examples, adversarial cases, structured-output validity, tool-call correctness, retrieval correctness, latency and cost thresholds, refusal and over-answering tests, human spot checks) is its own product. The practical rule: if the company can't afford evals, it can't afford self-hosting. The rest of the stack just costs more.

The two clean cases for the tier remain. First, when residency, contract, or air-gap policy makes third-party inference impossible. Second, when steady-state volume is high enough and engineering capacity is mature enough that a dedicated stack pays back the operational tax. Outside those two, self-hosting tends to look cheap on a token table and expensive in the runbook.

Failure modes when the partition is drawn wrong

Four failure modes recur. Two sit on the diagonal of risk and tier choice; two are off-axis. The figure below maps the diagonal cases. Under-routing and misplaced self-hosting sit underneath because they don't live on the cost-tier grid.

Figure 3 — The diagonal of routing failures. Off-diagonal cells are the named failures. The two notes underneath are off-grid because they don't sit on the cost-tier plane.

Over-frontier

Over-frontier is the default failure. The team builds against OpenAI, Anthropic, or Google, ships, and never revisits the routing question. Six months in, the bill has a long tail of summarization, classification, internal search, and batch document work running on Opus 4.7 or GPT-5.5 because that's what the original prototype used. Every call is fine; the aggregate is a tax on workloads that didn't need the capability they bought.

The tax compounds with reasoning-mode inflation. DeepSeek's V4 modes (non-think, think high, think max) and Anthropic's effort controls and tokenizer warning are reminders that "per million tokens" can move 35% in either direction without a price-table change. A workload routed to a thinking-mode frontier model when the same workload runs cleanly on hosted Qwen3.5 or DeepSeek V4-Flash is paying twice: once for unused capability, once for tokens the cheaper model wouldn't have generated.

The fix is not "replace frontier." The fix is to stop sending all workloads to frontier by default. Build the router. Log every call by category, model, cost, and quality. Move bounded, high-volume categories to hosted open-weight first.

Under-routing

Under-routing is over-frontier's quieter sibling. The team agrees in principle that a router would help, then never builds it because the engineering cost looks bigger than the savings on this quarter's traffic. The router stays a slide. Token bills keep growing in line with usage instead of below it.

Under-routing also shows up as binary thinking: the team believes the choice is "stay on Anthropic" or "rip and replace with DeepSeek." Neither is correct. The router is incremental. A first version is one if-statement that sends bounded high-volume tasks to a hosted open-weight endpoint and everything else to the existing frontier model, plus logs that compare cost per resolved task across both paths. That ships in days, not quarters, and the data the first version produces is what funds the second version.

Misplaced self-hosting

Misplaced self-hosting is the most expensive failure of the three. The team self-hosts an open-weight model on rented GPUs because the procurement deck said it would be cheaper, runs it at 20% utilization, ships without an eval harness, and discovers six weeks later that the per-token cost is higher than the hosted endpoint they replaced and the quality is worse because no one tuned the serving stack.

The pattern is recognisable in the runbook. Capacity is provisioned for peak and used at average. Tokenizer behavior changes between the model card and the production deployment and no one notices because there is no eval comparing checkpoints. A vLLM or SGLang upgrade alters latency or tool-call formatting and a tier of customer flows breaks silently. A safety filter starts suppressing legitimate answers and the support team blames the prompt, not the runtime. Each of these is a managed-API-style problem that the buyer absorbed without the managed-API support contract.

The honest test for self-hosting: would you self-host your database? If the answer is yes for residency or contract reasons, self-host the model the same way and budget the same operational discipline. If the answer is no, self-hosting the model layer for cost reasons is almost always misplaced.

Brittle cheap-tier

Brittle cheap-tier is the failure mode that cancels the cliff. The team correctly identifies a high-volume workload, routes it to a hosted open-weight endpoint, and treats the routing decision as final. The cheap model handles the easy cases well. It also handles a small percentage of customer-visible cases badly: one wrong refund quote, one bad tool call, one missed escalation, one fabricated policy line. The cost of those cases erases the token savings and then some.

The lesson is the one the support-bot boundary already taught: cheap first-pass models are fine; cheap final authority is not. Customer-visible, tool-executing, account-actioning, legally sensitive, medically sensitive, financially sensitive, and abuse-adjacent paths need an escalation gate to a frontier model or to a human. The router is not just "which model gets the call." It is also "what happens when this model says it's not sure."

A methodology for drawing your own line

1. Inventory workloads by failure cost, volume, and structure

Pull 30 to 90 days of model traffic. If volume is low, use all of it. If volume is high, sample at least 1,000 calls per major surface. Tag each call by surface, task type, customer-visibility, tool use, output structure, latency requirement, current model, input and output token counts, retry rate, and whether the call drove a downstream human review or escalation.

Calculate cost per resolved task per category, not cost per token. The first output is a workload map: which categories are high-volume and bounded, which are low-volume and high-stakes, which are tool-heavy, which are batchable. The second output is a baseline you can route against.

2. Map governance and residency constraints first

Before testing cheaper tiers, remove categories that have to stay where they are. PHI under HIPAA wants a BAA-ready provider or self-hosting. Customer contracts that prohibit external inference want approved cloud-region deployment or self-hosting. Sovereign or air-gapped environments want self-hosting. Categories with strict abuse-monitoring obligations want a provider with explicit abuse logging terms. This step prevents the most common operator mistake: treating the partition as a cost decision when half of it is a control decision.

3. Test candidate workloads on cheaper tiers with your own evals

For each candidate Hosted-open-weight or Self-host category, run representative historical calls through the cheaper tier in a sandbox. Use the same prompts, retrieval, tools, and policies you plan to ship. Score against your own task-level metrics, not vendor benchmarks.

For classification, measure precision, recall, and refusal behavior on adversarial cases. For extraction, measure field accuracy, hallucinated fields, and confidence calibration. For summarization, measure factual fidelity, omissions, and tone match. For coding drafts, measure test pass, patch quality, and whether the model touched files it should not. A category qualifies for migration only when the cheaper tier matches or beats the frontier on the metric the business actually pays for.

4. Build the router as a policy artifact

The router is not a config flag. It is a product surface with a policy log, a fallback path, a kill switch, an escalation rule, and a measurement. At minimum, route by task category, customer-visibility, and risk class.

Log every call with model, prompt version, tool schema, tool-call result, retrieval source, latency, input and output tokens, cost, downstream outcome, and whether the call escalated. Wire fallback routing so that a hosted-open-weight path that times out, returns malformed structured output, or trips a confidence threshold falls forward to a frontier model rather than back to the user. Treat the router itself as a governance artifact: it is the audit surface for "which model touched which customer," not just a cost optimization.

5. Iterate from cost-per-resolved-task data, not token bills

Review the router weekly for the first month and at least monthly after. Track cost per resolved task by category, retry rate, escalation rate, fallback-fire rate, customer satisfaction by category, and human-edit distance for human-in-the-loop flows. Categories where the cheaper tier is matching frontier on quality and beating it on cost are candidates to widen. Categories where the cheaper tier is regressing on quality, generating retries, or driving downstream review are candidates to narrow or escalate. The router moves from evidence, not from vendor pricing pages.

Classification rubric

Use these workload archetypes to partition your own traffic by analogy. The default tier is the right starting point, not a permanent verdict.

High-volume classification, tagging, intent detection. Default tier: hosted open-weight. Bounded output, measurable quality, retry cost low.
Document extraction with stable schema. Default tier: hosted open-weight. Field accuracy is testable; cliff is large at volume.
Long-document summarization, transcript synthesis. Default tier: hosted open-weight. Long context plus high volume is where the cliff is biggest.
RAG synthesis on internal docs. Default tier: hosted open-weight. Retrieval carries most of the quality; model matters less.
Batch back-office automation. Default tier: hosted open-weight (batch tier). Latency-tolerant; batch discounts compound the cliff.
First-pass support reply with frontier escalation. Default tier: hosted open-weight. Cheap drafts plus a confidence-gated handoff beat single-tier on cost and quality.
Multilingual support across many locales. Default tier: hosted open-weight. Qwen-class models cover wide language support at fractional cost.
Code explanation, draft review, doc-string drafts. Default tier: hosted open-weight. Human edits the output; cheap is the right anchor.
Complex agentic coding with autonomous tool use. Default tier: premium frontier. Bad tool calls are expensive; benchmark headroom matters.
Customer-facing reply on regulated or sensitive topics. Default tier: premium frontier. Wrong answer cost exceeds any plausible token savings.
Hard reasoning, ambiguous-intent synthesis. Default tier: premium frontier. Quality gap on the long tail still favors frontier.
Long-horizon agent with binding actions. Default tier: premium frontier. Reliability premium is the procurement metric.
Enterprise-procured deployment requiring SOC, BAA, or ZDR. Default tier: premium frontier (enterprise). Compliance artifacts come with the contract, not the model.
PHI, financial, or legal data with residency or BAA requirement. Default tier: premium frontier (enterprise) or self-host. Provider terms or in-house control; cost is secondary.
Sovereign, air-gapped, or on-prem deployment. Default tier: self-host. Third-party inference is structurally blocked.
Steady-state high-utilization workload with mature SRE and evals. Default tier: self-host. Utilization makes the GPU bill beat hosted endpoints.
Edge or on-device inference. Default tier: self-host (small model). Locality, offline behavior, and latency are the deliverable.
Adversarial, safety-critical, or abuse-adjacent path. Default tier: premium frontier with human gate. Cheap final authority is the failure mode to avoid.

What to validate before drawing your own line

Confirm your regulatory posture, customer contract obligations, residency requirements, and abuse-monitoring duties. Categories that fail this check are off the table for hosted endpoints regardless of price. Measure your current traffic distribution and identify which categories are high-volume, bounded, and stable enough to migrate; cliffs only matter at volume.

Test candidate cheaper tiers explicitly on your hardest cases: adversarial inputs, edge formats, ambiguous intent, missing data, tool failures, long-context contradictions. A model that handles the easy 80% at one-fiftieth the price still loses if the hard 20% breaks the workflow. Confirm the hosted provider's data terms, retention, region, subprocessor list, deletion path, and incident obligations before any production traffic moves; a cheap endpoint is not a compliance strategy.

If you're considering self-host, confirm utilization projections, eval coverage, observability tooling, on-call rotation, and rollback path. If any one of those is missing, the tier isn't ready for you. Assign a human owner for weekly router review. If no one owns the iteration, the partition will decay as models, prices, and traffic shapes change.

The router is the deployment. The model is just what executes the call.

Key takeaways

The model layer now has three viable tiers, not two: premium frontier APIs, hosted open-weight endpoints, and self-hosted open-weight stacks. The cheapest option is usually not self-hosting.
Token price is a procurement metric. Cost per resolved task is the operating metric. Premium models can be cheaper at the workflow level when retries, escalations, and bad tool calls are the alternative.
Four failure modes recur: over-frontier (defaulting everything to premium), under-routing (never building the partition), misplaced self-hosting (rented GPUs at low utilization with no evals), and brittle cheap-tier (hosted open-weight on customer-facing or tool-executing paths with no fallback).
Governance and residency decide more partitions than price does. Map those constraints before running cost comparisons; a category that has to stay on premium enterprise APIs or self-host is not a routing question.
The router is the deployment. Build it as a policy artifact with logs, fallback paths, escalation rules, and weekly review. Move workloads against your own evidence, not vendor pricing pages.

LinkedIn X Email

Methodology

The partition was built by triangulating four kinds of evidence. Public pricing pages from OpenAI, Anthropic, Google Cloud, DeepSeek, Together, Fireworks, Groq, Runpod, and Lambda established the dollar floor and the structural cliffs as of April 30, 2026; we read each page directly and recorded list rates plus stated batch, flex, prompt-cache, residency, and reserved-capacity multipliers rather than blended figures. Vendor model cards and benchmark posts from OpenAI, Anthropic, Google DeepMind, DeepSeek, Alibaba, and Meta supplied the capability signals; we treat these as shortlist inputs, not procurement evidence. Serving and observability documentation from vLLM, SGLang, llama.cpp, Langfuse, Phoenix, and the OpenTelemetry GenAI semantic conventions established what self-hosting actually buys you in operational obligations. Public coverage of benchmark reproducibility (the LMArena Llama 4 Maverick controversy) supplied the reminder that vendor scores are necessary and insufficient. The Premium-frontier tier is grounded in convergent evidence: agentic coding and hard-reasoning benchmarks, enterprise privacy and compliance terms, and the integration surface that buyers actually contract for. The Hosted-open-weight tier is grounded in first-party DeepSeek pricing plus hosted prices on Together, Fireworks, and Groq, cross-checked against published benchmark and serving claims. The Self-host tier is grounded in GPU rental rates, the published throughput characteristics of vLLM and SGLang, and the operational obligations documented by Langfuse and OpenTelemetry. Specific dollar comparisons in this dossier ignore retries, cache hits, batch discounts, provider minimums, routing overhead, and tokenizer differences; that makes them useful for direction, not procurement. Final partition decisions belong to the operator's own evals on the operator's own task suite.

Sources

Tools mentioned