Alcreon
Back to Podcast Digest
AI Engineer··40m

Bending a Public MCP Server Without Breaking It — Nimrod Hauser, Baz

TL;DR

  • Vanilla MCP tools can tank agent performance — Nimrod Hauser showed a baseline run using Playwright’s public MCP server through LangChain’s load_mcp_tools, and the agent hallucinated a fake buzz.co/spec-reviewer page, returned a false failure, and even botched the screenshot.

  • Tool descriptions are not neutral; they steer behavior — Playwright’s generic descriptions like “Press a key on the keyboard” were too shallow for Baz’s spec reviewer, so Hauser wrapped tools with custom guidance like “prefer accessibility snapshot before click,” which materially improved navigation.

  • Curating the tool list reduces confusion — By cutting Playwright’s tool set from 21 to 16 and removing irrelevant actions like resize, drag, and arbitrary code execution, Baz gave the agent fewer bad options and a simpler context window.

  • Deterministic guardrails matter for security, not just quality — Hauser added path validation around screenshot saving so the agent could only write inside an approved screenshots directory, blocking rogue paths while returning a helpful retry message instead of crashing the whole flow.

  • You can compose new tools from old ones to create better affordances — Baz built an “evidence screenshot” tool on top of the regular screenshot function, with a separate description telling the agent to use it at the end of review flows and include the ticket number in the filename.

  • Some steps should leave the agentic loop entirely — Login was handled deterministically by injecting JWT tokens into browser local storage before handing control to the agent, because auth is mandatory, sensitive, and too brittle to trust to open-ended reasoning.

The Breakdown

The MCP Server Is “On Fire,” and That’s the Point

Nimrod Hauser opens with a joke that their public MCP server has “caught on fire,” which becomes the running metaphor for the whole talk: third-party tools are useful until they quietly wreck your workflow. His setup is grounded in Baz’s actual product work—AI-powered code review and spec review for R&D teams—so this isn’t a theoretical MCP rant.

What Third-Party Agent Tools Really Are

He strips the concept down nicely: tools are just callable functions plus descriptions, and that description is what tells the model when and how to use them. The key tension is that third-party tools, whether from an MCP server, a library, or copy-pasted code, are written for everyone—not for your architecture, your workflow, or your risks.

The Spec Reviewer Demo: Ticket, Design, Browser, Verdict

The toy use case is Baz’s spec reviewer, which reads requirements from places like Jira or Linear, inspects Figma designs multimodally, then launches Playwright to compare the live implementation against the spec. In the demo, the agent is supposed to verify a configuration drawer inside Baz’s “agents” tab and capture evidence with screenshots.

Baseline Failure: 21 Playwright Tools and One Hallucinated Page

Version 0 is the pure out-of-the-box setup: LangChain loads Playwright’s MCP tools, all 21 of them, with generic descriptions like “Resize the browser window” or “Close the page.” The result is a miss—the agent hallucinates a nonexistent buzz.co/spec-reviewer route, returns a failed verdict, and points to a broken 404 screenshot, which is exactly the kind of “unpredictability at scale” Hauser warns about.

Fix #1 and #2: Curate the Tools, Then Rewrite Their Descriptions

First, Baz simply excludes tools the spec reviewer does not need, like browser resizing, dragging, and running arbitrary code in the page, shrinking the set from 21 tools to 16. Then Hauser wraps the remaining tools with Baz-specific descriptions, especially telling the agent to use Playwright’s accessibility snapshot tool before actions like click or hover, because that text-based structure gives the model a much better mental map of the page.

Fix #3: Add Hard Guardrails Where the Agent Can’t Be Trusted

The talk then shifts from context engineering into enforcement. Around screenshot saving, Baz validates that any path chosen by the agent stays inside an approved screenshots root, because in real systems an unconstrained tool can become a security issue—especially in multi-tenant setups where agents don’t understand all your data boundaries. Crucially, the wrapper raises an agent-friendly error message instead of blowing up the whole run, nudging the model to retry with a compliant path.

Fix #4 and #5: New Composite Tools and Deterministic Login

Baz also creates a new “evidence screenshot” tool on top of the regular screenshot function, using a separate description to teach the model when to choose it and how to name files with ticket IDs. And for login, Hauser takes the opposite route: no agentic flexibility at all. Baz directly calls Playwright functions, injects JWTs into local storage, clicks through auth, and only then hands the browser to the agent—because login is both mandatory and too fragile to improvise.

The Final Run Works—and the Bigger Lesson Lands

With all five changes in place, the agent successfully logs in, finds the configuration drawer, returns a pass verdict, and saves a correctly named screenshot as evidence. Hauser’s wrap-up is refreshingly practical: there is no one-size-fits-all recipe here, just a series of levers—curation, wrapping, guardrails, composition, and deterministic escape hatches—that let you bend public MCP servers toward your use case without letting them break your app.