Back to Podcast Digest
AI Engineer18m

Your Coding Agent Should Do AI System Engineering — Ben Burtenshaw, Hugging Face

TL;DR

  • Ben Burtenshaw’s core claim is that coding agents are ready for “hard mode” AI engineering — not just app code, but CUDA kernels, LLM fine-tuning, and even multi-agent research workflows that touch real compute, benchmarking, and training jobs.

  • Custom CUDA kernels are no longer off-limits for agents — Ben points to GPU Mode hackathons, AMD events, and KernelBench as evidence, then shows Hugging Face Kernels as the missing distribution layer so generated kernels can actually be packaged, benchmarked, and used.

  • The bottleneck in deep learning is often memory, not math — his H100 example makes it concrete: roughly a petaflop of compute versus 3 TB/s memory bandwidth, which is why techniques like FlashAttention matter because they increase arithmetic intensity and “keep the GPUs warm.”

  • Skills are the practical trick that turns zero-shot engineering into few-shot engineering — Hugging Face bakes file-based, versioned skills into projects so agents can open benchmark scripts, examples, and usage patterns on demand, and Ben says this helped generate a Qwen 3 8B H100 kernel with a 94% speedup.

  • Hugging Face is positioning the Hub as infrastructure for agentic systems engineering — kernels, HF CLI skills, Jobs, Papers, Trackio, storage, and compute are all presented as open primitives that agents can orchestrate rather than black-box APIs they can’t see behind.

  • The most ambitious demo is an “automated AI lab” split into researcher, planner, worker, and reporter agents — inspired by Andrej Karpathy’s auto-research work, Ben’s AutoLab fans out literature search, hypothesis generation, code changes, training runs, and dashboard reporting into a parallel loop that can run for hours.

The Breakdown

Agents Have Crossed the Acceptance Threshold

Ben opens by saying the argument is basically over: coding agents have been “accepted,” and the real question now is how engineers stay contemporary. His answer is to move closer to the silicon and use agents on tougher problems like AI systems engineering and ML engineering, framing the talk as three escalating “video game bosses.”

Boss One: Let Agents Write CUDA Kernels

He starts with the once-heretical idea that agents can write optimized CUDA kernels, something many people considered too hardware-specific and too messy to benchmark. Ben says that assumption has largely broken, citing GPU Mode hackathons, the AMD hackathon, and KernelBench as proof that agents can produce valid, optimized kernels.

Why Kernels Matter: GPUs Wait on Memory

Ben pauses to explain the mechanics: kernels are the units of actual GPU work when running AI models, and optimizing them is about compute, memory, and overhead. The memorable punchline is that most people guess compute is the bottleneck, but on a modern GPU like the H100, memory often wins — the chip can do about a petaflop per second, but only move memory at 3 TB/s, so the goal is to “keep the GPUs warm” by doing more math per read, like FlashAttention does.

Hugging Face Kernels and Skills as Agent Infrastructure

The practical problem isn’t just generating kernels — it’s distributing them, describing compatibility, and plugging them into inference. Ben presents Hugging Face Kernels as a Hub-native repo format with metadata about hardware and CUDA versions, then explains “skills” as file-based context that lets agents pull examples, benchmarking scripts, and usage docs when needed, turning a zero-shot task into a few-shot one.

Measuring Whether the Skills Actually Work

He shows this isn’t just theory by describing a benchmark where a generated kernel for Qwen 3 8B on H100 got a 94% speedup, though he’s careful to say this isn’t some state-of-the-art universal result. His point is more tactical: there’s low-hanging fruit in hardware/model mismatches, and the open-source Upskill tool helps compare which models use a skill best, calling out examples like GPT-OSS, Kimi, and Haiku on accuracy versus token cost.

Boss Two: Zero-Shot Fine-Tuning on the Hub

The second boss is much simpler: tell an agent to fine-tune a model like Qwen 3 6B on a chain-of-thought dataset, and let the Hugging Face stack handle the rest. Ben moves quickly here, pointing people to his colleague Merve’s deeper talk and to blog posts showing Claude and Unsloth-based workflows, emphasizing that this is already integrated with Hub compute and often comes with free credits to try.

Boss Three: A Multi-Agent “AutoLab” for Research

The big finale is AutoLab, inspired by Andrej Karpathy’s recent auto-research project that had Claude iteratively improve nanoGPT training runs. Ben’s twist is to split the work across specialized agents: a researcher scouts papers via HF Papers, a planner turns ideas into a queue, workers implement training-script changes and launch HF Jobs, and a reporter monitors everything in Trackio.

Open Primitives Beat Black Boxes

Ben walks through how this looks in practice inside Open Code: agents use templates, branch off experiments, review stale or duplicate ideas, and run for hours while Trackio collects metrics, events, and warnings. His closing takeaway is blunt: agents work best with open primitives like Trackio’s parquet-backed data layer and Hub-native storage/compute, because opaque abstractions create ceilings, while well-exposed systems let agents actually engineer.

Share