How Top Engineers Are Solving the Code Review Bottleneck
TL;DR
Code review is now the scaling problem: Florian points to Google acknowledging review as a bottleneck, with Google reportedly at 50% AI-generated code in 2025 and pushing toward 75%, which shifts pressure onto senior engineers and downstream review systems.
The harness can matter more than the model: In Florian's experiments, the same frontier model performed differently depending on the harness, with Claude Code working best at one point and Codex later becoming stronger for implementation work.
Specs alone were not enough, but tests plus feedback worked: His spec-driven attempt failed because models drifted from intent, while a TDD-style setup with behavioral tests and automated stop-hook feedback finally produced reliable implementation in his project.
Guardrails should encode human review comments before code reaches GitHub: Florian recommends local, fast checks like formatters, linters, Semgrep rules, security checks, and architectural tests so agents can self-correct without waiting for a senior engineer in a PR.
Architecture remains firmly human work: He says engineers still need to decide what to build, sketch the system, define module boundaries, and lock interfaces, because losing architectural understanding is how teams slide into cognitive debt and cognitive surrender.
A simple first experiment is to turn repeated PR feedback into Semgrep rules: Examples he gives include banning Python default parameter values and forcing errors to be propagated, then measuring whether the agent needs less babysitting with those rules in place.
The Breakdown
The real bottleneck in AI software engineering is no longer writing code but reviewing the flood of it, and Florian Buetow argues the best teams are shrinking human code review by pushing feedback into the agent's environment with tests, guardrails, and architectural constraints. His blunt takeaway: the harness often matters more than the model, and engineers who can define architecture and encode their judgment as rules will have a huge advantage.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.