Back to Podcast Digest
AI Engineer19m

Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft

TL;DR

  • AI is exploding code volume, but not automatically productivity — Marlene Mhangami opens with GitHub’s scale jump from 1 billion commits in 2025 to a projected 14 billion in 2026, then argues the real question is whether AI-written code actually helps teams ship better software.

  • Clean codebases are the multiplier for AI gains — citing a Stanford study of 120,000 developers, she says AI helps most when teams already have strong tests, type coverage, documentation, and modularity; otherwise it just “amplifies entropy.”

  • Traditional TDD’s weakness is overfocusing on unit coverage instead of system behavior — she points to critiques from DHH and Ian Cooper, showing how brittle tests tied to implementation details can break on a rename while missing whether the product still works for users.

  • AI-generated tests can be deceptively green — Mhangami warns that models often create self-affirming unit tests, so a fully passing suite may still fail to validate actual behavior in the app.

  • Playwright gives teams a faster, behavior-first TDD loop — her recommended flow is to have agents generate failing end-to-end Playwright tests first, generate code to make them pass, then spend the most human attention in the refactor stage.

  • Her live demo shows Copilot CLI + Playwright testing a toy store like a user would — using a Tail Spin Toys example, the agent writes and runs tests for search (“Furby,” “Simon”), category filters, and price ranges, with Playwright clicking through the UI hands-free.

The Breakdown

The real AI question isn’t output — it’s whether teams get more productive

Marlene Mhangami starts with a striking GitHub stat: 2025 saw about 1 billion commits, the platform’s biggest year ever, and GitHub is now seeing roughly 275 million commits a week — a pace that could hit 14 billion by the end of 2026. A growing share of those commits are co-authored by AI agents, which tees up her core question: all this extra code is impressive, but does it actually make developers more productive?

Stanford’s lesson: AI helps clean systems and wrecks messy ones faster

She brings in a Stanford study of 120,000 developers, via an earlier AI Engineer talk, to make the point that AI gains depend heavily on how teams use it. Her summary is sharp: clean codebases amplify AI productivity, while unchecked AI usage amplifies entropy. In the example she cites, a team shipped more PRs with AI, but quality dropped and rework ballooned so much that effective output improved by only about 1%.

Why she still cares about standards in a “just ship it” era

From there she argues that if teams want real leverage from AI, they need the boring fundamentals: good test coverage, type coverage, docs, and modularity. She acknowledges this is mildly controversial at a conference where some people prefer to “close their eyes and ship,” but her view is that standardizing clean-code practices is becoming more important, not less.

TDD isn’t dead — but the unit-test obsession is the problem

Mhangami walks through classic red-green-refactor TDD, including Simon Willison’s recent post on red-green TDD, and explains why the workflow fits agentic coding surprisingly well. But she also revisits why TDD got backlash: too many teams equated it with piling up unit tests and code coverage. Citing DHH and Ian Cooper’s “TDD: Where It All Went Wrong,” she argues that tests tied to implementation details are brittle — rename a method like calculateTotal, and the test fails even if the user-visible behavior is still correct.

The AI testing trap: green tests that prove nothing

One of her most practical warnings is that AI often generates self-affirming tests. That means teams can get a fully green suite while never actually validating whether the system behaves correctly. Her fix is to move up a level and test functionality — the observable behavior of the app — using Playwright.

Why Playwright fits AI-assisted, behavior-first development

She introduces Playwright as Microsoft’s open-source browser automation framework for end-to-end testing, with support for Python, TypeScript, C#, and more. The pitch is simple: instead of asserting internals, you script what a user does — open a page, type “Furby” into search, click filters, and see what happens. In her AI-era version of TDD, the red and green phases get faster because agents can generate both behavioral tests and rough implementation quickly, leaving humans to focus more on refactoring.

The live demo: Copilot turns a product email into working browser tests

Her demo centers on a fictional toy retailer, Tail Spin Toys, where product asks for a search bar plus category and price filters. Using GitHub Copilot CLI and Microsoft’s “Work IQ” to pull requirements from M365 into the terminal, she has the agent inspect the codebase, write failing Playwright tests, then run the app-level tests. The best part is the visual proof: Playwright opens the page, types “Furby” and “Simon,” clicks filter buttons, checks price ranges, and passes the suite while her hands stay off the keyboard.

Her practical playbook: screenshots, headless runs, and one feature per test

She closes with a few battle-tested habits: add Playwright screenshots to PRs, run headless when you don’t need the browser visible, and commit before letting the agent start changing code so it has a clean point of reference. She also recommends one feature, one test, and in Q&A notes that Playwright Agents can help with more complex state-heavy apps, while direct API testing is a good fallback. For now, she says, Playwright is browser-based — though it does handle desktop and mobile browser sizes.

Share