LIVE VIBE CHECK: OPUS 4.7 DROPS
TL;DR
Anthropic skipped early access for Opus 4.7, so Every tested it live from scratch — Dan Shipper opened by joking that Anthropic had "snubbed" them, then ran real workflows in public across Claude, Claude Code, Cursor, and co-work instead of relying on benchmark claims like the reported ~10% bump on SWE-bench Pro and ~7% on SWE-bench Verified.
The clearest early pattern: Opus 4.7 is more literal and less gap-filling than Opus 4.6 — Across writing, coding, and financial analysis, the team kept finding that 4.7 followed instructions more exactly but was worse at inferring unstated intent, which made old 4.6 prompts feel weaker unless rewritten with more explicit direction.
For structured business work, 4.7 looked strong and reliable — Dan gave it Every’s March 2026 P&L and asked for an investor update, and it correctly pulled the numbers, wrote in a direct operator-style tone, and produced something close to what he would have actually sent, especially for a $600k-revenue month recap.
For creative writing, Katie preferred Opus 4.6’s voice over 4.7’s cleaner but more systematic prose — In a side-by-side intro draft about using Margot to manage OKRs, 4.6 felt more unpredictable and human, while 4.7 sounded more regimented, even with Katie’s style guides and writing skills layered into Claude Code.
On the 'vibe slop benchmark,' Opus 4.7 could diagnose architecture problems but hesitated to do the scary rewrite — Dan’s test used an old, brittle version of Proof’s collaborative editor codebase, and 4.7 correctly identified the core issue ('no single authoritative model' for document state) yet kept nibbling around the edges instead of fully 'burning the ships' and refactoring from first principles.
Anthropic researcher Alex Albert said that behavior is partly intentional: 4.7 is tuned to be 'more on the mark,' and you need to ask for intensity — His practical advice was to use higher effort levels, be explicit about wanting more tool calls and more thoroughness, and even say things like 'I’m going to bed' to signal that the model has time to keep iterating on long-running tasks.
The Breakdown
A launch-day scramble instead of the usual polished preview
Dan Shipper kicked things off with Brandon, matching green shirts, and a little mock heartbreak: Anthropic hadn’t given Every early access to Opus 4.7, so the usual pre-baked vibe check was out. That changed the whole tone of the stream — less verdict, more live lab — as they pulled up Anthropic’s claims about better long-running tasks, more precise instruction-following, and a new habit of verifying its own output before responding.
First tests in co-work: finance, writing loops, and OpenClaw setup
Dan’s first real task was practical: feed Every’s messy P&L into Claude and have it draft a March 2026 investor update. At the same time, he tried a workflow where Claude sat inside Proof as a kind of live writing companion, using a scratchpad to keep suggesting ideas as he drafted a manifesto. Brandon, meanwhile, ran a quirky but revealing test: ask Opus 4.6 and 4.7 to build an "OpenClaw" version of him from Claude memories and usage patterns — and immediately noticed 4.7 was slower and organized the output in a less useful way.
Katie’s writing test: less AI smell, or just less personality?
Katie jumped in from Claude Code to stress-test the writing side, especially whether 4.7 reduced the usual "AI smell." Her side-by-side article intros showed the tradeoff clearly: Opus 4.6 produced more surprising, voicey lines like "Cloudy with a chance of unemployment," while 4.7 was cleaner and more systematic but also flatter, less like her actual cadence. Her early reach test was blunt: if she had to go back to work right away, she’d default to 4.6 for writing.
Brandon’s OpenClaw and P&L verdict: accurate numbers, lazier analysis
Brandon’s strongest complaint wasn’t that 4.7 got things wrong — it mostly got the numbers right — but that it stopped too early. In monthly P&L analysis, 4.6 had previously surfaced non-obvious accounting issues like failed Mercury transactions inflating product costs, while 4.7 mostly described obvious month-over-month changes unless heavily pushed. He saw the same pattern in app-building tasks too: 4.7 often completed the assignment, but in a more shallow or "lazy" way than 4.6.
The 'vibe slop benchmark' exposed 4.7’s caution
Dan’s signature coding test used an old version of Proof, a collaborative editor he admits he originally vibe-coded into a production "slop machine" that kept going down. The benchmark asks whether a frontier model can look at that ugly codebase and think like a senior engineer: identify the true architectural flaw, then execute a brave rewrite. Opus 4.7 passed the diagnosis part beautifully — it spotted the missing single source of truth around document authority — but then got timid, choosing safe slices and migration-ish steps instead of actually tearing the system down and rebuilding it.
A surprisingly revealing toy test: the minimalist to-do app
Brandon also gave both 4.6 and 4.7 the same lightweight app task: build a minimalist to-do app, add one magical feature, then add one AI feature. Both worked, and 4.7 even made flashy confetti animations, but the AI feature told the deeper story: 4.6 transformed vague tasks into usable sub-tasks, while 4.7 merely "sharpened" the text and dumped multiple steps into a single to-do. It was a tiny example, but it echoed the broader theme that 4.7 often follows the letter of the prompt without making the extra leap into better UX.
Alex Albert from Anthropic reframed the whole vibe check
The most useful section came when Anthropic researcher Alex Albert joined and more or less confirmed what the team was feeling: Opus 4.7 is "more on the mark," meaning more literal, less likely to freestyle beyond the prompt, and better if you explicitly ask for thoroughness. He said Anthropic has been oscillating between models that do too much and models that need more direction, and 4.7 sits on the more controllable side. His practical tips were gold for builders: raise effort levels, ask for more tool calls and more changes, lean into long-running background tasks, and literally tell the model you’re "going to bed" if you want it to keep working.
The final synthesis: probably stronger, but not with your old prompts
By the end, Dan, Katie, and Brandon all converged on the same provisional read: Opus 4.7 may be a more capable power tool, but it’s not a drop-in upgrade if your workflows were tuned for Opus 4.6’s more intuitive, gap-filling behavior. Dan liked 4.7 for structured, blended tasks like finance plus writing; Katie still preferred 4.6 for voice-heavy work; Brandon felt 4.7 was under-reaching unless prodded. The energy of the stream was basically: this model may be better, but only if you learn how to drive it.