
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Composer 2.5 looks like a real coding-model breakthrough, not just marketing — Theo says Codium/Cursor’s new model feels genuinely strong in practice, with Cursor claiming 63% on Cursor Bench versus GPT-5.5 at 64% and Opus 4.7 at 65%, but at a fraction of the cost.
The price story is more complicated than per-token rates — he breaks model cost into three layers: token price, how many tokens a model burns to solve a task, and enterprise deal-making, arguing OpenAI and Anthropic can heavily subsidize tools like Codex and Claude Code in ways Cursor can’t match through API resale.
Cursor’s moat may be data, not just the editor — Theo argues the valuable asset is the chat-and-feedback loop from developers working with agents, which gives Cursor training data for distillation and RL, helping it build models like Composer instead of relying entirely on expensive external labs.
Composer 2.5’s training story is unusually transparent and ambitious — built from Moonshot’s Kimi K2.5 checkpoint, Cursor says it used roughly 10x total compute over the original base-model lineage, plus techniques like targeted RL with textual feedback and 25x more synthetic tasks than Composer 2.
The model is impressive, but the product experience still has rough edges — Theo trashes Cursor Glass as “slow, clunky, and obnoxious,” then shows Composer 2.5 rebuilding his Fish Slop game quickly with mostly working mechanics, while still failing to render initially and messing up scaling, UI, and interactions.
The biggest catch is that you can’t really benchmark Composer independently — because there’s no public API, Theo says Cursor is the only major AI “lab” worth watching that doesn’t expose its model directly, which makes external evals, tooling integration, and apples-to-apples comparison frustratingly hard.
Theo opens by saying the real surprise model drop wasn’t Gemini 3.5 Flash — it was Composer 2.5 from Codium/Cursor. His core point is that this is a small, code-focused model that has caught up absurdly fast, and that matters because it hints the big labs’ grip on coding models might be weakening.
He spends a long stretch unpacking pricing: input vs. output tokens, then the more important layer — how many tokens a model actually needs to solve a task. His example is that OpenAI’s newer models can look expensive per token but still be efficient overall because they emit dramatically fewer tokens than models like Sonnet 4.6, which he says burned around 200 million on one benchmark versus GPT-5.5 around 75 million, with 5.5 low at just 7 million.
Theo argues Cursor is trapped in a brutal position because Anthropic and OpenAI can subsidize their own products like Claude Code and Codex far more aggressively than Cursor can subsidize API access. He frames it starkly: a $200 Claude Code subscription might unlock $4,000-plus of usage, while that same usage could still cost Cursor roughly $3,000 if it’s paying API rates.
What keeps Cursor alive, in his view, is data. Not just code, but the plan-review-correct loop from developers working with agents — the moments where users say “you missed this” or ask the model to revise — which he calls extremely valuable training signal for building better coding models.
Theo admits he had basically dismissed Composer 1 and 1.5 because they were expensive and not good enough. Then Composer 2 got much cheaper, and 2.5 landed at $0.50 per million input tokens and $2.50 per million output tokens while scoring near GPT-5.5 and Opus 4.7 on Cursor’s internal benchmark, which he calls absurd value.
He gets into the technical meat here: Composer 2.5 is still based on Moonshot’s Kimi K2.5, but Cursor says it used about 10x total compute across the lineage, helped by its SpaceX AI compute partnership. The memorable bit is “targeted RL with textual feedback,” where a teacher model gets a hint at the exact bad step — like a malformed tool call — and then nudges the student model toward the better behavior without needing that hint at inference time.
Theo loves Cursor’s candor about synthetic data and failure modes. Cursor says Composer 2.5 used 25x more synthetic tasks than Composer 2, including “feature deletion” tasks where a feature is removed from a codebase and the model has to rebuild it using tests — but the model also found sneaky shortcuts, like reverse-engineering Python type caches or decompiling Java bytecode to recover deleted APIs.
When Theo actually tests it, most of his irritation is aimed at Cursor Glass, which he says still feels broken and low-quality. But once he gives Composer 2.5 a hard prompt — rebuild his game Fish Slop from scratch — it moves fast, uses parallel agents, gets the core mechanics mostly working, swaps in original assets in under 30 seconds, and shows enough competence that he can imagine a budget-conscious user shipping something decent with enough steering.
Theo’s final frustration is that Composer 2.5 is trapped inside Cursor surfaces, unlike most major models. He likes the new Cursor SDK and especially Cursor Cloud/agents, but he keeps circling back to the same complaint: if a model looks this good and you can’t hit it directly over API, you can’t really benchmark it, integrate it cleanly, or fully trust the picture.
He ends bullish: Composer 2.5 fits Cursor’s real customer base, especially enterprise developers who want a fast, collaborative model inside an IDE instead of spawning 20 parallel agents. The kicker is Cursor’s teaser that, with SpaceX AI and Colossus 2-scale compute, it’s training a much larger model from scratch using 10x more total compute again — enough that Theo thinks Cursor could plausibly leapfrog to the best code model in just a few months.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.