
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Karpathy says December was a real inflection point for coding agents — he went from fixing AI-generated code to “I can’t remember the last time I corrected it,” which Matthew Berman frames as the moment frontier users felt models plus tool harnesses become end-to-end useful.
The core shift is from specifying steps to verifying outcomes — Karpathy’s line is that traditional software automates what you can specify in code, while LLMs automate what you can verify, which explains why code and math have improved so quickly.
“Software 3.0” means prompting an LLM like it’s the computer itself — building moves from writing explicit rules or training task-specific models to steering a general model with context, with the LLM acting as CPU and the context window as RAM.
AI’s brilliance and stupidity come from jagged intelligence, not general intelligence — the same model that can refactor a 100,000-line codebase or find zero-days can still fail common-sense questions like whether to walk 50 meters to a car wash, because labs heavily reward verifiable domains.
Karpathy’s founder advice is blunt: verifiable domains are tractable, but labs will likely absorb the obvious ones — if a problem can be turned into strong RL environments and easy checks, startups may still fine-tune successfully, but they’re also building where foundation-model companies can move fastest.
Vibe coding raises the floor; agentic engineering raises the ceiling — anyone can now build rough software with AI, but professional teams still need taste, orchestration, and quality control, with Karpathy comparing today’s agents to brilliant but unreliable interns.
The conversation opens with Karpathy admitting he’s “never felt more behind as a programmer,” which lands because it’s Andre Karpathy saying it, not some random hype account. He describes a clear break around December: coding agents stopped being useful for snippets and started producing larger chunks that “just came out fine,” until he was trusting them enough to start vibe coding in earnest.
Karpathy revisits his framework: software 1.0 is handwritten code, software 2.0 is programming via datasets and learned weights, and software 3.0 is prompting a general-purpose model through context. Berman reinforces the mental model with Karpathy’s old diagram: the LLM is basically the CPU, the context window is RAM, and the surrounding tools are just peripherals around this new neural computer.
One of Karpathy’s examples is OpenClaw installation: instead of a giant cross-platform bash script, the “installer” is just text you paste to your agent. That’s the paradigm shift in one tiny example — don’t over-specify steps, state the outcome and let the model inspect the environment, debug, and act. Berman ties this to products like Here and his own Journey Kits, where setup instructions have shrunk into a few lines of agent-facing text.
Karpathy tells a story about building a menu app the old way — OCR, image generation, rendering, hosting — only to see a software 3.0 version that simply gives the photo to Gemini and asks it to overlay the menu items directly into the image. His reaction is basically: that whole app “shouldn’t exist.” Berman connects this to Tesla and the “bitter lesson”: once end-to-end neural nets get good enough, hand-authored heuristics start looking like technical debt.
This is the heart of the talk. Karpathy says LLMs excel where outputs can be verified, because training now looks like giant RL environments with clear rewards, which creates “jagged” capability spikes in domains like coding and math. That’s why a model can crush refactors and security work yet still tell you to walk 50 meters to a car wash — wildly competent in high-reward, checkable spaces, weirdly dumb at simple everyday judgment.
Asked what startups should do if labs are already dominating coding and math, Karpathy says verifiable domains are still tractable because founders can create their own RL environments and fine-tune on them. But there’s a catch: those same properties also make them easy for the big labs to swallow eventually. His half-teasing, half-frustrating answer is that there are valuable RL environments people aren’t focusing on yet — but he won’t quite say which ones.
Karpathy draws a clean distinction: vibe coding raises the floor so anyone can build software, while agentic engineering preserves the professional quality bar while using these “spiky,” stochastic agents to go faster. Berman loves that framing and adds examples like Peter Steinberger running dozens or even 100 agents in parallel across coding, deployment, bug-finding, and PRs — not just prompting, but orchestration as a real engineering discipline.
Near the end, Karpathy says agents today are basically intern-like: powerful, but still needing human oversight, judgment, aesthetics, and direction. He predicts a world of agent-first infrastructure and agent-to-agent interaction, complaining that docs are still written for humans when what he wants is simply “the thing I should copy paste to my agent.” He closes with the line he can’t stop thinking about: “You can outsource your thinking, but you can’t outsource your understanding,” which becomes the video’s real warning label.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.