
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
They built a production chess coach by splitting calculation from explanation — Play Magnus runs Stockfish, custom tactical/positional detectors, and the Maya human-move model first, then uses an LLM mainly to turn that grounded context into natural-language feedback.
The core problem is that LLMs are bad at chess, not at talking about chess — Asbjørn says models can handle openings for a bit but then hallucinate moves, which is why Magnus Carlsen could roast Grok’s “Poison Pawn” line from their Oslo office during a Kaggle LLM chess tournament.
Their system explains the why behind a move, not just whether it was good or bad — In one example, the app marks a knight capture on e5 as “brilliant” and explains the tactical threats and detectors that fired, while another example turns a vague “bad move” like ...f5 into a concrete lesson about trapping a queen and why a central pawn capture saves it.
They closed the feedback loop with an autonomous agent that can ship fixes — When users downvote commentary in the app, it gets posted to Slack and injected into a running Cloud Code channel, where an agent can triage the issue, edit prompts or detectors, regenerate commentary, ask the team questions, and even prepare a PR for merge from a phone.
Latency won over raw model sophistication because this is a consumer product — They targeted sub-3-second feedback after a game, landed on Gemini 3 Flash with roughly 1 second to first token and about 3 seconds end-to-end, and rejected slower reasoning-heavy models for the main review flow.
Their eval story is practical, not academic — They test 16 real-game scenarios around blunders, tactical patterns, and hallucinations, use LLM-as-judge plus their own chess strength as SME review, and report Gemini Flash at about 75%, Claude with more thinking just under 60%, and GPT-5 mini lower on accuracy.
Anant opens with the hook: yes, that Magnus Carlsen — widely considered the best player in the world — founded Play Magnus, where both speakers work. The app lets people play, post games, and now get an AI-powered game review that explains moments like a knight capture on e5 leading to mate, complete with “brilliant” labels and commentary about threats, tactics, and plans.
Asbjørn runs through the lineage from Claude Shannon’s 1949 paper to Deep Blue beating Kasparov in 1997, then to DeepMind’s AlphaGo and AlphaZero. He frames it as the old split between brute-force “type A” engines and more intuitive “type B” systems — and notes that modern LLMs reintroduced the dream of explanation, even if they still can’t actually play solid chess.
The funniest proof is Magnus commenting on an LLM chess tournament from their Oslo office, dunking on Grok for drifting into a Poison Pawn line and then losing badly. Asbjørn’s point is simple: LLMs are trained on language, not board search, so they hallucinate moves — though transformer architectures themselves aren’t the issue, since DeepMind has shown a transformer trained on position evaluations can reach grandmaster strength.
Their solution is to keep the model tightly grounded. First they analyze the whole game with Stockfish, then extract context using detectors for forks, pins, skewers, doubled pawns, and other tactical or structural themes, and they also use Maya from the University of Toronto to estimate what moves humans at, say, 1500 rating would actually find.
That extra context lets them explain not just that a move was bad, but why. In Asbjørn’s example, an opponent’s ...f5 is more than a “bad move” indicator: the system can say it threatens to trap the queen with Bg5, but also explain that capturing in the center defends the escape square and gets the queen out.
If a user reports commentary as bad in the app, it posts to Slack and gets injected into a running Cloud Code channel, which kicks off a commentary-triage skill. The agent can inspect the position, rerun generation, modify prompts or detectors, verify the fix, and ping a human in Slack — so Asbjørn can literally approve and merge from a bus ride if it looks right.
Because users want review immediately after a game, they optimized for speed over elaborate reasoning. Their target was sub-3 seconds, and Gemini 3 Flash hit roughly 1 second time-to-first-token and about 3 seconds end-to-end, while slower reasoning models may be fine later for a more patient “chat with your coach” experience.
They evaluate with 16 chess scenarios pulled from real games, covering tactical patterns, blunders, and hallucination resistance, and use OpenRouter to swap models quickly. The bigger takeaway is broader than chess: separate the data pipeline from language generation, invest in context extraction even if it starts as a painful giant JSON blob, and use autonomous agents plus domain experts to tighten the loop.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.