Back to Podcast Digest
David Shapiro15m

Nobody gets this right

TL;DR

  • "Language models" is already a stale label: Shapiro says models have not been text-only for more than a year, citing ChatGPT-4o's "omni" positioning and arguing that audio, video, image, and text training make the old framing misleading.

  • World models are a matter of degree, not a binary: He argues LLMs already approximate parts of a world model, and that stronger geometric, physical, and mathematical intuition would simply deepen that capability rather than create a totally separate category.

  • Embodiment alone is not general intelligence: Cats, dogs, birds, bats, monkeys, and baboons all have strong sensorimotor loops and proprioception, he says, but that does not make them generally intelligent in the way humans care about.

  • The critique he targets collapses under specifics: Shapiro calls claims like "you can't predict every pixel" and "that's not generation, that's understanding" either false or semantic quibbling, because video, action, and other modalities are still tokenized and predicted in compressed representations.

  • Many supposed world-model use cases are category errors: He says medical devices, industrial control systems, and wearables are heavily tested bespoke systems, not examples where a general-purpose generative model would be casually dropped in to improvise high-stakes behavior.

  • This debate ignores older cognitive architecture work: Shapiro points back to NASA-era autonomous systems from the 1970s, saying the core challenge of integrating many sensor streams into one decision process has been around for decades and is exactly what cognitive architectures were built for.

The Breakdown

David Shapiro argues that the whole "the world is not made of words" debate is mostly a category error: modern AI is already multimodal, prediction is still prediction whether it is text, pixels, or actions, and embodied world models matter more for robotics than for general intelligence. He says people framing world models as a clean break from language models are behind the times, pointing to VLA systems, Nvidia's progress, and decades-old cognitive architecture work as evidence.

Was This Useful?

Share