[ECHO]June 19, 20264 min read

Nine Minutes, No Humans

What took a human team six hours, Anthropic's Opus 4.7 did alone in a robotics test.

Anthropic published Project Fetch Phase Two on June 18. The followup to a warehouse experiment in August 2025, it added a third condition to the original setup. Phase One had two teams of non-roboticists race to get an off-the-shelf quadruped to retrieve a beach ball: one team with Claude Opus 4.1 access, the other with only the internet. Phase Two added Claude Opus 4.7 running the same tasks alone. Across the four tasks where all three conditions completed, humans alone took 361 minutes, humans helped by Opus 4.1 took 181 minutes, and Opus 4.7 alone took 9 minutes and 35 seconds.

In Phase One, the third condition wasn't possible. Before bringing in humans, Anthropic checked whether Opus 4.1 could run the tasks autonomously and concluded it couldn't, so it stayed out of the experiment. Ten months later the lab ran the same check on Opus 4.7, which passed.

The tasks weren't trivial. Operating the quadruped meant connecting to its video and lidar sensors, writing control programs from scratch, detecting the beach ball through onboard vision, and putting it together into an autonomous retrieval loop. Opus 4.7 wrote 1,045 lines of code across its solution. The human team with Claude access wrote 10,309. Roughly a tenth of the code, twenty times faster, no humans on the keyboard.

Robotics isn't the story; off-the-shelf quadrupeds aren't infrastructure most companies operate. What changed in ten months was who could run the test. Phase One required humans to operate the robodog. By Phase Two, the humans were optional.

The curve that crossed in this warehouse is the same curve quietly running across other domains where agents already work as copilots. Code review on contained PRs, customer-support triage on common tickets, internal-tools development on well-scoped requests. Each of those is a Phase One configuration today, with its own Phase Two ahead.

Project Fetch's four-task timings, from humans alone to model alone in ten months.

Labs don't always announce the precise moment a workflow becomes capable of running autonomously. They run the experiment, write up the result, and ship it on a Thursday. June 18 is a Thursday. The threshold doesn't post itself on a calendar; you find out it crossed when you read the writeup.

Anthropic's own caveats hold. The lab states plainly that "this doesn't mean that LLMs have now solved robotics. Far from it." Opus 4.7 still struggled to precisely move the beach ball after picking it up. The model defaulted to an outdated object-detection algorithm on one trial. The tasks didn't require the low-level actuation control that's the actually hard part of physical robotics. None of this is a general claim that autonomous AI can replace human judgement on physical work. It's a narrow claim that the lab proved on a specific setup with a specific model on a specific Thursday.

What's worth seeing isn't the robodog. It's that the same kind of test runs on your laptop, in an hour, against whatever your team is shipping next month. Every humans-plus-AI workflow your team currently runs is one model upgrade away from a finding like this one. The labs run their experiments quietly. You have to run yours on the work that actually matters to your team.

What to Do With This

If you use Claude or another frontier model as a copilot, pick one task you do regularly where you currently start, the model contributes, and you finish. This week, try it again with the model running the whole loop and you reviewing the output, just to see what the answer is on the work that actually matters to you.

If you lead a team that uses AI across several workflows, name one workflow currently in the humans-plus-AI configuration. Write down the diff between the assist version and the autonomous version without deploying it. You'll need that diff when one of your vendors quietly closes the gap.

If you own a product roadmap or run a small company, name one process that currently depends on humans plus AI together. Three months from now, run the same check, and either outcome will tell you something you didn't know last quarter.

Want More Than This Newsletter?

Alcreon publishes a daily AI briefing, long-form dossiers, and an analysis feed for the teams actually shipping AI in production. This newsletter is one read out of the full library.

Read the daily feed or browse the editorials.

Also on the Radar

OpenAI's Autonomous Chemist Improves a Real Medicinal-Chemistry Reaction

OpenAI and Molecule.one published results on June 17 from a three-month project where GPT-5.4 connected to an autonomous wet-lab platform ran 10,080 reactions and improved the yield of a difficult Chan-Lam coupling from 16.6% to 25.2%, with reactions clearing 30% rising from 15.6% to 37.5%. Human chemists validated 14 substrate pairs; 11 saw at least twofold gains. The other model-ran-the-loop-alone result this week.

OpenAI Ships a Benchmark Its Own Flagship Scores 36.1% On

OpenAI introduced LifeSciBench on June 17, a 750-task benchmark for real life-science research authored by 173 PhD-level scientists. The lab's strongest entry on its own scoreboard, GPT-Rosalind, cleared just 36.1%, leaving roughly two-thirds of the benchmark beyond what the best public model can solve. A lab shipping a benchmark its flagship loses on is the inverse of the marketing-leaderboard pattern.

LinkedIn X Email