Nobody gets this right
TL;DR
"Language models" is already a stale label: Shapiro says models have not been text-only for more than a year, citing ChatGPT-4o's "omni" positioning and arguing that audio, video, image, and text training make the old framing misleading.
World models are a matter of degree, not a binary: He argues LLMs already approximate parts of a world model, and that stronger geometric, physical, and mathematical intuition would simply deepen that capability rather than create a totally separate category.
Embodiment alone is not general intelligence: Cats, dogs, birds, bats, monkeys, and baboons all have strong sensorimotor loops and proprioception, he says, but that does not make them generally intelligent in the way humans care about.
The critique he targets collapses under specifics: Shapiro calls claims like "you can't predict every pixel" and "that's not generation, that's understanding" either false or semantic quibbling, because video, action, and other modalities are still tokenized and predicted in compressed representations.
Many supposed world-model use cases are category errors: He says medical devices, industrial control systems, and wearables are heavily tested bespoke systems, not examples where a general-purpose generative model would be casually dropped in to improvise high-stakes behavior.
This debate ignores older cognitive architecture work: Shapiro points back to NASA-era autonomous systems from the 1970s, saying the core challenge of integrating many sensor streams into one decision process has been around for decades and is exactly what cognitive architectures were built for.
The Breakdown
David Shapiro argues that the whole "the world is not made of words" debate is mostly a category error: modern AI is already multimodal, prediction is still prediction whether it is text, pixels, or actions, and embodied world models matter more for robotics than for general intelligence. He says people framing world models as a clean break from language models are behind the times, pointing to VLA systems, Nvidia's progress, and decades-old cognitive architecture work as evidence.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.