AI EngineerMay 6, 202618m

Missions: Multi-Agent Systems That Ship for Days — Luke Alvoeiro, Factory

TL;DR

The real bottleneck is human attention, not model intelligence — Luke Alvoeiro argues today’s models are already smart enough to tackle a backlog of 50 tasks, but humans can only supervise a few at a time, which is why Factory built a system that can keep shipping for hours or even days.
Factory’s “missions” package multi-agent patterns into one long-running workflow — instead of a single coding session, missions combine delegation, creator-verifier, broadcast, and negotiation across three roles: orchestrator, workers, and validators.
Validation is the whole game, and it starts before any code exists — missions write a “validation contract” during planning, sometimes with hundreds of assertions, so tests don’t just rubber-stamp implementation decisions after the fact.
Parallel agents sound fast but usually collide in software work — after trying it, Factory found the coordination overhead from conflicting changes and duplicated work outweighed the gains, so missions run features serially with targeted parallelism only for read-only tasks like search and code review.
The system’s longest mission ran 16 days, and most wall-clock time wasn’t spent on tokens — it was spent in behavioral validation, where QA-style agents actually launch the app, click through flows, fill forms, and verify end-to-end behavior.
Model choice becomes a new engineering skill Luke calls “droid whispering” — planning, implementation, and validation each want different model strengths, and Factory treats model-agnostic routing as a structural advantage, even using open-weight models successfully when the workflow scaffolding is strong.

Summary

From Goose to Factory’s bet on autonomous software work

Luke Alvoeiro opens with a pretty blunt thesis: software engineering isn’t bottlenecked by intelligence anymore, it’s bottlenecked by human attention. He frames missions as the answer to that mismatch — humans decide what to build, then an agent system keeps executing while you go do something else. He also grounds it in his own lineage, from dev tools at Block to Goose, the open-source coding agent later donated to the AI Agentic AI Foundation.

A simple map through the multi-agent chaos

He says the multi-agent landscape is “a bit of a mess,” then offers a cleaner taxonomy of five frontier patterns: delegation, creator-verifier, direct communication, negotiation, and broadcast. The useful distinction is that each solves a different coordination problem — from sub-agents doing discrete tasks to validators acting like fresh reviewers without the builder’s “cost bias.” Broadcast gets less hype, but he calls it essential for long-running coherence.

Missions: one goal, three roles, days of execution

Factory’s system combines four of those patterns into a single workflow it calls a mission. The architecture has an orchestrator for planning, workers for implementation, and validators for verification, with the orchestrator producing a plan, milestones, and a “validation contract” that defines what done means before any coding starts. That’s the key move: this isn’t one agent with a giant context window, it’s an ecosystem held together by structured handoffs and shared state.

Why post-hoc tests aren’t enough

Luke describes a familiar failure mode: an agent writes code, then writes tests that pass, but those tests just confirm the decisions the agent already made. His line is memorable: tests written after implementation don’t catch bugs, they confirm decisions. Missions try to break that loop by creating the validation contract up front, then running both a scrutiny validator — tests, lint, type checks, dedicated review agents — and a user-testing validator that actually boots the app and interacts with it like a QA engineer.

The handoff discipline that keeps a 16-day run from drifting

For long missions, memory isn’t trustable, so workers are forced to write down exactly what happened: what they completed, what they skipped, what commands they ran, the exit codes, what issues they found, and whether they followed the orchestrator’s procedures. Luke says that’s how the system “self-heals” at milestone boundaries, by scoping corrective work from explicit records instead of hoping the next agent remembers the past. That structure is what enabled their longest mission to run for 16 days, with the team believing 30 is possible.

Why serial beats parallel in real codebases

The obvious idea is to throw 10 agents at the problem, but Luke says that fell apart in software development because agents step on each other’s changes, duplicate work, and make inconsistent architectural calls. Missions therefore run features serially, while only parallelizing read-only work like code search, API research, and validator code review. It looks slower on paper, but he says the lower error rate compounds over multi-day runs.

Mission Control, model routing, and a Slack clone as proof

Because chat UIs break down over days-long jobs, Factory built Mission Control so you can glance at progress, budget burn, active workers, validator findings, and course corrections — or just go hang out with friends. He also argues there’s no single best model for planning, implementation, and validation, calling the skill of assigning them “droid whispering,” and says using a different provider for validation can reduce shared-model bias. In their Slack clone example, validation never passes on the first try, about 60% of time and tokens go to implementation, roughly 50% of final lines are tests, and 90% of code ends up covered.

Built to improve with models, not get replaced by them

Luke closes with the “bitter lesson” anxiety every multi-agent builder has: what if the next model release obsoletes your architecture? Factory’s answer was to keep orchestration mostly in prompts and skills — about 700 lines of text, plus thin deterministic bookkeeping — so the system gets better as models do. His final economic claim is simple: if five engineers could once sustain 10 workstreams, missions might push that to 30, while humans stay focused on architecture and product decisions instead of babysitting execution.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Missions: Multi-Agent Systems That Ship for Days — Luke Alvoeiro, Factory

Summary

From Goose to Factory’s bet on autonomous software work

A simple map through the multi-agent chaos

Missions: one goal, three roles, days of execution

Why post-hoc tests aren’t enough

The handoff discipline that keeps a 16-day run from drifting

Why serial beats parallel in real codebases

Mission Control, model routing, and a Slack clone as proof

Built to improve with models, not get replaced by them

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

From Goose to Factory’s bet on autonomous software work

A simple map through the multi-agent chaos

Missions: one goal, three roles, days of execution

Why post-hoc tests aren’t enough

The handoff discipline that keeps a 16-day run from drifting

Why serial beats parallel in real codebases

Mission Control, model routing, and a Slack clone as proof

Built to improve with models, not get replaced by them

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks