Joe ReisMay 7, 202655m

AI Agents Can't Fix Data - Josh Wills on Where AI Breaks in Data Engineering

TL;DR

The hard part of foundation models isn’t training — it’s data — Josh Wills says you can build a state-of-the-art model with GPUs, Meta’s TorchTitan, and open datasets like Hugging Face FineWeb, but the real secret sauce is curating the right data.
Spark is still weirdly irreplaceable for giant shuffles — despite leaving “raw” Spark in 2017 and returning in 2024, Wills found almost nothing had changed because Spark still beats newer systems on huge, ugly join/shuffle workloads across many nodes.
Enterprise companies may be better off training their own models than renting intelligence forever — for firms with proprietary domain data, Wills argues a one-time $20 million model build can be smarter and cheaper than paying OpenAI, Anthropic, or Google tens of millions per month in perpetuity.
AI agents are useful, but they consistently fail at one critical engineering move: admitting they don’t understand the problem — Wills describes benchmarking bottlenecks where agents confidently diagnose the wrong cause, fix the wrong thing, and produce no improvement because they jump to conclusions too early.
Data engineering’s core problem hasn’t changed in 25 years — whether it’s Kimball-style warehousing or GPU data loaders, the enduring tension is still how much work to do up front versus online at runtime, which Wills calls a fundamentally “wicked problem.”
The near-term risk isn’t superintelligence — it’s people vibe-coding expensive garbage — Wills shares an anecdote of a giant Spark pipeline generated by a model that technically worked but cost “a couple hundred thousand” to run because there’s almost no training data for well-engineered large-scale data systems.

Summary

From Kimball to multimodal petabytes

Josh Wills opens by sketching the arc of a 25-year career: Cloudera in the big data era, building Slack’s early data infra and search indexing pipeline, writing the DuckDB adapter for dbt, and now working at Datology on data curation for foundation models. The scale jump is real — Datology has roughly 6–7 petabytes of data at a tiny company — but to him it feels less like a new field than “big data all over again,” just finally with a killer use case in multimodal model training.

Why he came back: doing the work beats sounding like a VC

Wills says the move back into hands-on engineering was surprisingly easy, partly because he came up in a pre-SQL-dominant era of raw MapReduce, Spark, and odd-shaped data. What pushed him back was more personal: after “retiring,” angel investing, and hearing himself talk at a Slack reunion, he realized he sounded like a venture capitalist — and hated it. He loves the pain and intimacy of real engineering work, and says that’s what made him a good speaker in the first place.

Spark never died — and that’s both impressive and depressing

His funniest observation from returning in 2024: nothing had really changed since 2017. Everyone had spent years living in SQL land, but for massive, weirdly shaped data, Spark still wins because “it fundamentally shuffles better than anything else.” He’s excited about Ray for map-heavy CPU/GPU workflows and intrigued by projects like Apache Celeborn/Kelliborn out of ByteDance, but he’s also openly worried about open-source Spark’s future and says hiring people who can still drive it without abstractions is now genuinely hard.

Building a foundation model is simpler than people think

Asked what it takes to start from scratch, Wills says the recipe is surprisingly straightforward: get GPUs, use TorchTitan, pull an open dataset like FineWeb, and go. He compares TorchTitan to someone publishing “the recipe for Coca-Cola” — the whole soup-to-nuts process is just sitting there. The scary part isn’t hidden training magic; it’s lighting $20 million on fire and discovering your pile of weights isn’t useful, even though the key evals are mostly open source.

Why enterprises should think harder about owning their own models

Wills is blunt here: if your company has unique domain data and your stock can swing because Anthropic ships a feature, outsourcing your future is reckless. His pitch is that many enterprises can blend their proprietary data with internet/code/math corpora and build a better, cheaper, fully controlled model for their domain — one capital expense instead of endless token spend. He does acknowledge the real blockers: finding deep learning talent and getting GPU capacity, both of which are painful right now.

The exact place agents break: premature certainty

The most concrete AI critique in the conversation comes from Wills’s own benchmarking workflow for data loaders feeding GPUs. He has agents run the tedious benchmarks, but when throughput hits a ceiling, they repeatedly latch onto one plausible bottleneck, explain it beautifully, fix it, and change nothing. His point is sharp: the real issue is often that nobody yet understands the problem, and agents are bad at saying “we need more context” instead of rushing into confident nonsense.

Data engineering is a wicked problem, not a solvable one

Wills ties everything back to what he sees as the fundamental theorem of data engineering: what work do you do up front, and what work do you do online? That tension spans dimensional models, training pipelines, and GPU data loaders alike, and he argues it will never be “solved,” only repeatedly re-resolved as technology, users, and constraints change. He connects that to Horst Rittel’s 1973 “wicked problems” paper — like education, poverty, or tax systems — where even understanding the problem changes the problem.

The messy future: everyone becomes a manager of unreliable agents

Near the end, the conversation gets more philosophical. Wills says he barely writes code directly anymore; instead, he manages coding agents, and he thinks that’s the new universal skill because “we are all managers now.” His caution is that the really important knowledge in management, politics, and enterprise systems often isn’t written down — which means there’s no training data — so AI will stay weakest exactly where tacit human context matters most.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

AI Agents Can't Fix Data - Josh Wills on Where AI Breaks in Data Engineering

Summary

From Kimball to multimodal petabytes

Why he came back: doing the work beats sounding like a VC

Spark never died — and that’s both impressive and depressing

Building a foundation model is simpler than people think

Why enterprises should think harder about owning their own models

The exact place agents break: premature certainty

Data engineering is a wicked problem, not a solvable one

The messy future: everyone becomes a manager of unreliable agents

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

From Kimball to multimodal petabytes

Why he came back: doing the work beats sounding like a VC

Spark never died — and that’s both impressive and depressing

Building a foundation model is simpler than people think

Why enterprises should think harder about owning their own models

The exact place agents break: premature certainty

Data engineering is a wicked problem, not a solvable one

The messy future: everyone becomes a manager of unreliable agents

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks