
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
The hard part of foundation models isn’t training — it’s data — Josh Wills says you can build a state-of-the-art model with GPUs, Meta’s TorchTitan, and open datasets like Hugging Face FineWeb, but the real secret sauce is curating the right data.
Spark is still weirdly irreplaceable for giant shuffles — despite leaving “raw” Spark in 2017 and returning in 2024, Wills found almost nothing had changed because Spark still beats newer systems on huge, ugly join/shuffle workloads across many nodes.
Enterprise companies may be better off training their own models than renting intelligence forever — for firms with proprietary domain data, Wills argues a one-time $20 million model build can be smarter and cheaper than paying OpenAI, Anthropic, or Google tens of millions per month in perpetuity.
AI agents are useful, but they consistently fail at one critical engineering move: admitting they don’t understand the problem — Wills describes benchmarking bottlenecks where agents confidently diagnose the wrong cause, fix the wrong thing, and produce no improvement because they jump to conclusions too early.
Data engineering’s core problem hasn’t changed in 25 years — whether it’s Kimball-style warehousing or GPU data loaders, the enduring tension is still how much work to do up front versus online at runtime, which Wills calls a fundamentally “wicked problem.”
The near-term risk isn’t superintelligence — it’s people vibe-coding expensive garbage — Wills shares an anecdote of a giant Spark pipeline generated by a model that technically worked but cost “a couple hundred thousand” to run because there’s almost no training data for well-engineered large-scale data systems.
Josh Wills opens by sketching the arc of a 25-year career: Cloudera in the big data era, building Slack’s early data infra and search indexing pipeline, writing the DuckDB adapter for dbt, and now working at Datology on data curation for foundation models. The scale jump is real — Datology has roughly 6–7 petabytes of data at a tiny company — but to him it feels less like a new field than “big data all over again,” just finally with a killer use case in multimodal model training.
Wills says the move back into hands-on engineering was surprisingly easy, partly because he came up in a pre-SQL-dominant era of raw MapReduce, Spark, and odd-shaped data. What pushed him back was more personal: after “retiring,” angel investing, and hearing himself talk at a Slack reunion, he realized he sounded like a venture capitalist — and hated it. He loves the pain and intimacy of real engineering work, and says that’s what made him a good speaker in the first place.
His funniest observation from returning in 2024: nothing had really changed since 2017. Everyone had spent years living in SQL land, but for massive, weirdly shaped data, Spark still wins because “it fundamentally shuffles better than anything else.” He’s excited about Ray for map-heavy CPU/GPU workflows and intrigued by projects like Apache Celeborn/Kelliborn out of ByteDance, but he’s also openly worried about open-source Spark’s future and says hiring people who can still drive it without abstractions is now genuinely hard.
Asked what it takes to start from scratch, Wills says the recipe is surprisingly straightforward: get GPUs, use TorchTitan, pull an open dataset like FineWeb, and go. He compares TorchTitan to someone publishing “the recipe for Coca-Cola” — the whole soup-to-nuts process is just sitting there. The scary part isn’t hidden training magic; it’s lighting $20 million on fire and discovering your pile of weights isn’t useful, even though the key evals are mostly open source.
Wills is blunt here: if your company has unique domain data and your stock can swing because Anthropic ships a feature, outsourcing your future is reckless. His pitch is that many enterprises can blend their proprietary data with internet/code/math corpora and build a better, cheaper, fully controlled model for their domain — one capital expense instead of endless token spend. He does acknowledge the real blockers: finding deep learning talent and getting GPU capacity, both of which are painful right now.
The most concrete AI critique in the conversation comes from Wills’s own benchmarking workflow for data loaders feeding GPUs. He has agents run the tedious benchmarks, but when throughput hits a ceiling, they repeatedly latch onto one plausible bottleneck, explain it beautifully, fix it, and change nothing. His point is sharp: the real issue is often that nobody yet understands the problem, and agents are bad at saying “we need more context” instead of rushing into confident nonsense.
Wills ties everything back to what he sees as the fundamental theorem of data engineering: what work do you do up front, and what work do you do online? That tension spans dimensional models, training pipelines, and GPU data loaders alike, and he argues it will never be “solved,” only repeatedly re-resolved as technology, users, and constraints change. He connects that to Horst Rittel’s 1973 “wicked problems” paper — like education, poverty, or tax systems — where even understanding the problem changes the problem.
Near the end, the conversation gets more philosophical. Wills says he barely writes code directly anymore; instead, he manages coding agents, and he thinks that’s the new universal skill because “we are all managers now.” His caution is that the really important knowledge in management, politics, and enterprise systems often isn’t written down — which means there’s no training data — so AI will stay weakest exactly where tacit human context matters most.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.