
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Fast models change the bottleneck from waiting to supervision — Sarah Chieng says Codex Spark runs at 1,200 tokens/sec versus roughly 40-60 for Sonnet or Opus, which means sloppy prompting now produces bad code 20x faster unless developers slow down and actively steer.
The speed jump is coming from the whole inference stack, not one trick — she points to hardware changes like Cerebras SRAM-on-chip, disaggregated prefill/decode, MoE architectures, pruning methods like REAP, and inference-layer KV cache reuse across companies like Together, Modal, Fireworks, and Base10.
'Agent swarm' setups look impressive but often just manufacture technical debt — her critique of six terminals, 500-agent swarms, and eight-agent five-screen rigs is that nobody is verifying the output, and faster inference makes that problem much worse.
Use smart models for planning and fast models for execution — her suggested workflow is a larger model like GPT-5.3/5.4 for long-horizon planning, then a fast executor like Codex Spark for subagents, repeated tasks, and replaying successful 'skills' from prior sessions.
Validation becomes basically free at 1,200 tokens/sec — instead of saving tests, linting, pre-commit hooks, diff reviews, browser QA, and cleanup for the end, she argues these checks should run at every step because they no longer meaningfully slow you down.
Context management gets more urgent as models speed up — if a context window used to fill in 10 minutes, a 20x faster model can hit compaction in 30 seconds, so she recommends breaking work into bounded tasks and externalizing state into files like agents.md, plan.md, progress.md, and verify.md.
Sarah Chieng opens with a blunt premise: developers picked up bad habits from slow code generation — giant one-shot prompts, massive commits, and too many agents running at once. With Codex Spark generating code at 1,200 tokens per second versus roughly 40-60 for models like Sonnet or Opus, those habits no longer waste minutes; they generate technical debt at industrial speed.
She zooms out to explain that this is a stack-wide shift, not a single model trick. On hardware, she frames the "memory wall" as the villain, noting that 50-80% of inference latency comes from memory movement; that is why systems like NVIDIA's off-chip HBM look different from Cerebras's SRAM-distributed wafer design, where each core can directly access what it needs.
One of her biggest technical points is that prefill and decode are finally being split across different hardware. Prefill is parallel and compute-bound, decode is sequential and memory-bound, so running both on the same machine is increasingly wasteful; she ties that trend to examples like Nvidia buying Groq for $20 billion and Cerebras partnering with AWS to combine its wafer with Trainium.
At the model layer, she calls out mixture-of-experts as the canonical example: activate only part of the model and you get the intelligence of a much larger system at the compute cost of a smaller one. She also mentions pruning techniques like REAP and software-layer optimizations like KV cache reuse from infrastructure players such as Together, Base10, Modal, and Fireworks.
Sarah has fun with the internet's current flex culture: six Claude Code terminals, 500-agent swarms, eight agents across five screens. Her point is not that parallelism is fake, but that social-media setups often hide the real issue — nobody is checking the code, and with faster models that becomes downright dangerous.
Her first workflow recommendation is orchestration by strength. Use a more capable model like GPT-5.3 or 5.4 for planning and long-horizon reasoning, then hand the actual checklist to fast executors like Codex Spark; if a session goes especially well, capture it as a reusable skill so a small, fast model can replay a verified trajectory over and over.
This is where speed becomes liberating instead of reckless. At 1,200 tokens per second, she says validation is "basically free," so tests, linting, pre-commit hooks, diff reviews, browser QA, and automatic refactors should run continuously, not just at the end; she also loves using fast models to generate 15 navbar variations — or 75 via subagents — so the human can cherry-pick the one with the best taste.
Her strongest behavioral advice is to stop treating AI coding like "spawn a session, get a hamburger, scroll Twitter, come back." Instead, she wants developers acting like real-time pair programmers: constrain the model, ban file deletion, cap diff size, say things like "only change this" or "don't touch types yet," and stay in the driver's seat because "the AI should always be helping you make decisions, not the other way around."
She closes on context management with a neat bit of math: if a model used to take 10 minutes to hit compaction, a 20x faster one gets there in 30 seconds. Her answer is external memory and bounded tasks — keep agents.md for roles, plan.md for the checklist, progress.md for state, and verify.md for quality gates — so each new session can pick up cleanly without dragging a bloated context window behind it.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.