Can LLMs generate Enterprise Quality Code? — Prasenjit Sarkar, Sonar
TL;DR
Benchmark wins do not equal enterprise quality: Sarkar says pass rates from HumanEval, MBPP, and SWE-bench measure functional correctness, but miss maintainability, security, architectural fit, and tech debt.
Sonar's dataset is big enough to expose real tradeoffs: The team evaluated 53-plus model variants on 4,444 distinct Java assignments using SonarQube Enterprise to track bugs, vulnerabilities, cyclomatic complexity, cognitive complexity, and lines of code.
High-scoring models can still be wildly verbose: GPT-5.4 and GPT-5.4 Pro High generated about 1.2 million lines of code for the 4,444 assignments, while older models like GPT-4.0 stayed under 250,000.
Security risk varies sharply across models: Gemini 3.1 Pro High led on SWE-bench pass rate at 84.17%, while Claude Sonnet 4.6 showed the highest security issue density in his example at about 300 issues per million lines of code.
LLM code quality problems come from both training data and model behavior: Sarkar points to mixed-quality open source training corpora, built-in security flaws, hidden logic bugs, limited organizational context, and the probabilistic nature of generation.
Sonar's answer is an agent-centric loop called ACDC: The Guide, Verify, Solve workflow uses context augmentation, pre-commit analysis in 1 to 5 seconds, and a remediation agent that fixes issues, recompiles, re-analyzes, and discards changes that would cause regressions.
The Breakdown
Sonar ran 53-plus models across 4,444 Java assignments and found a gap the usual coding benchmarks miss: top models can pass tests while still producing huge amounts of verbose, bug-prone, and security-risky code. Prasenjit Sarkar argues enterprise-ready AI coding needs a second layer of evaluation and a workflow that guides, verifies, and fixes agent-generated code before it lands in production.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.