Full Workshop: Build Your Own Deep Research Agents - Louis-François Bouchard, Paul Iusztin, Samridhi
TL;DR
They built deep research because generic AI writing was unusable — Louis-François Bouchard opens by roasting LinkedIn-style “AI slop” like “delve into the intricacies” and outdated claims like GPT-4 being state of the art, then frames the real goal: automate rigorous research and technical drafting without losing the human storytelling layer.
The core design choice is simple: research should be agentic, writing should not — the team split the system into an exploratory research agent and a constrained writing workflow because research needs to pivot across the web, while writing needs tone, structure, and anti-slop guardrails rather than more autonomy.
Their deep research agent is a thin MCP server with three tools, backed mostly by Gemini — Samridhi shows a FastMCP setup exposing a web-grounded research tool, a YouTube analysis tool, and a compile-research tool, with outputs written to a memory folder and assembled into a final
research.mdfile.Gemini won on practical workflow advantages, not just model quality — they switched from Perplexity and earlier setups to Gemini because grounding works well, Gemini can directly analyze YouTube URLs without manual transcript extraction, and the free tier matters for students.
The writing system gets its quality from control layers, not one magic prompt — Paul Iusztin uses a structured guideline file, static writing profiles, 3 few-shot LinkedIn examples, and a reviewer-editor loop run 3-4 times to turn research into posts that pass slop detectors and stay close to a specific voice.
They treat LLM evaluation like classic ML, including train/dev/test splits and F1 score — for the writing judge, Paul builds a labeled dataset from 20 real LinkedIn posts, calibrates an LLM-as-judge on dev, checks for overfitting on test, and uses observability in Opik to inspect token cost, traces, and failure modes.
The Breakdown
Why “AI slop” was the starting problem
Louis-François Bouchard kicks off by putting a painfully generic LinkedIn-style AI post on screen and pointing out everything wrong with it: stale facts, vague claims like “most teams,” tired phrases like “rapidly evolving,” and the bigger sin — it says almost nothing. That becomes the workshop’s thesis: if you want useful AI-generated content, you need actual research and much tighter writing constraints.
The autonomy slider: most “agents” should really be workflows
He frames AI engineering as a set of tradeoffs around cost, latency, quality, and privacy, then introduces an “autonomy slider” from prompting to workflows to agents. His point is blunt: clients often ask for flashy multi-agent systems, but most real business needs are better served by simpler workflows with fixed steps, because more autonomy usually means more cost and less control.
Two client stories that shaped their design instincts
Louis contrasts a support-ticket pipeline, which always followed the same classification-routing-drafting-validation sequence, with a Canadian CRM chatbot project that initially asked for a multi-agent setup. In that second case, the team pushed back and used one agent with specialized tools instead, arguing that splitting tightly coupled context across many agents would just introduce reliability problems and more room for errors.
Context rot, delegation, and when multi-agent systems actually help
From there he gets into context windows: not just hard limits, but “context rot,” where performance starts degrading long before the million-token ceiling, often around 200,000 tokens. That leads to the practical takeaway behind multi-agent systems: use them when context gets too crowded, tool count gets too high, or compliance boundaries force separation — not because “crew AI” sounds impressive.
Their real product: a research agent feeding a deterministic writer
For Towards AI, the actual use case was generating technical course content from a topic, complete with code, images, and strong factual grounding. So they split the system cleanly: a flexible research agent that can search, inspect, pivot, and synthesize, and a more deterministic writer that follows structure and tone — because, as Louis keeps stressing, writing is where humans still matter most.
Samridhi’s MCP build: three tools, one memory folder, lots of practical glue
Samridhi walks through the implementation: a FastMCP server exposes tools, prompts, and resources, while Claude Code acts as the agent harness. The research side is intentionally narrow — one grounded web search tool, one YouTube analysis tool, and one report compiler — with each tool writing artifacts into a memory folder so the final compiler can assemble a cited research.md report.
Why Gemini became the backbone
A lot of the stack choices come down to convenience that matters in production: Firecrawl for scraping, gitingest for GitHub repos, UV for Python setup, and Gemini for both grounding and multimodal handling. Samridhi highlights the neat part live: Gemini can take a YouTube URL directly, actually watch the video, and produce a detailed transcript and summary, which lets the agent treat YouTube as a first-class research source.
Paul’s writing workflow: profiles, few-shot examples, and a reviewer loop
Paul Iusztin takes over for the writing half and makes the case for not using an agent at all. Instead, he builds a system prompt from a structured guideline file, static writing profiles for structure/terminology/character, and three handpicked LinkedIn examples, then runs a reviewer-editor loop 3-4 times so the model can tighten wording, fix violations like banned slop words, and better match a human-seeming LinkedIn voice.
Observability and evals: treat it like ML, not prompt vibes
The final section is about operational discipline: they use Opik to inspect traces, token usage, latency, and tool calls, because debugging agents from terminal logs alone is miserable. Paul then builds an LLM judge from 20 real LinkedIn posts, labels pass/fail with critiques, splits the dataset into train/dev/test, and measures F1 score — basically arguing that if you’re serious about AI systems, you need evaluation loops, not just “this looked good once.”