AI EngineerMay 13, 202618m

Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take

TL;DR

They built a production chess coach by splitting calculation from explanation — Play Magnus runs Stockfish, custom tactical/positional detectors, and the Maya human-move model first, then uses an LLM mainly to turn that grounded context into natural-language feedback.
The core problem is that LLMs are bad at chess, not at talking about chess — Asbjørn says models can handle openings for a bit but then hallucinate moves, which is why Magnus Carlsen could roast Grok’s “Poison Pawn” line from their Oslo office during a Kaggle LLM chess tournament.
Their system explains the why behind a move, not just whether it was good or bad — In one example, the app marks a knight capture on e5 as “brilliant” and explains the tactical threats and detectors that fired, while another example turns a vague “bad move” like ...f5 into a concrete lesson about trapping a queen and why a central pawn capture saves it.
They closed the feedback loop with an autonomous agent that can ship fixes — When users downvote commentary in the app, it gets posted to Slack and injected into a running Cloud Code channel, where an agent can triage the issue, edit prompts or detectors, regenerate commentary, ask the team questions, and even prepare a PR for merge from a phone.
Latency won over raw model sophistication because this is a consumer product — They targeted sub-3-second feedback after a game, landed on Gemini 3 Flash with roughly 1 second to first token and about 3 seconds end-to-end, and rejected slower reasoning-heavy models for the main review flow.
Their eval story is practical, not academic — They test 16 real-game scenarios around blunders, tactical patterns, and hallucinations, use LLM-as-judge plus their own chess strength as SME review, and report Gemini Flash at about 75%, Claude with more thinking just under 60%, and GPT-5 mini lower on accuracy.

Summary

From Magnus Carlsen to a chess coach in your pocket

Anant opens with the hook: yes, that Magnus Carlsen — widely considered the best player in the world — founded Play Magnus, where both speakers work. The app lets people play, post games, and now get an AI-powered game review that explains moments like a knight capture on e5 leading to mate, complete with “brilliant” labels and commentary about threats, tactics, and plans.

A quick history lesson: chess and AI have been tangled together for decades

Asbjørn runs through the lineage from Claude Shannon’s 1949 paper to Deep Blue beating Kasparov in 1997, then to DeepMind’s AlphaGo and AlphaZero. He frames it as the old split between brute-force “type A” engines and more intuitive “type B” systems — and notes that modern LLMs reintroduced the dream of explanation, even if they still can’t actually play solid chess.

Why LLMs fall apart over the board

The funniest proof is Magnus commenting on an LLM chess tournament from their Oslo office, dunking on Grok for drifting into a Poison Pawn line and then losing badly. Asbjørn’s point is simple: LLMs are trained on language, not board search, so they hallucinate moves — though transformer architectures themselves aren’t the issue, since DeepMind has shown a transformer trained on position evaluations can reach grandmaster strength.

The real pipeline: Stockfish thinks, detectors inspect, the LLM translates

Their solution is to keep the model tightly grounded. First they analyze the whole game with Stockfish, then extract context using detectors for forks, pins, skewers, doubled pawns, and other tactical or structural themes, and they also use Maya from the University of Toronto to estimate what moves humans at, say, 1500 rating would actually find.

Making commentary feel useful instead of generic

That extra context lets them explain not just that a move was bad, but why. In Asbjørn’s example, an opponent’s ...f5 is more than a “bad move” indicator: the system can say it threatens to trap the queen with Bg5, but also explain that capturing in the center defends the escape square and gets the queen out.

The downvote-to-PR loop is the most modern part of the whole thing

If a user reports commentary as bad in the app, it posts to Slack and gets injected into a running Cloud Code channel, which kicks off a commentary-triage skill. The agent can inspect the position, rerun generation, modify prompts or detectors, verify the fix, and ping a human in Slack — so Asbjørn can literally approve and merge from a bus ride if it looks right.

Fast enough to feel instant, even if smarter models exist

Because users want review immediately after a game, they optimized for speed over elaborate reasoning. Their target was sub-3 seconds, and Gemini 3 Flash hit roughly 1 second time-to-first-token and about 3 seconds end-to-end, while slower reasoning models may be fine later for a more patient “chat with your coach” experience.

Their closing lesson: build the plumbing, then let the model speak

They evaluate with 16 chess scenarios pulled from real games, covering tactical patterns, blunders, and hallucination resistance, and use OpenRouter to swap models quickly. The bigger takeaway is broader than chess: separate the data pipeline from language generation, invest in context extraction even if it starts as a painful giant JSON blob, and use autonomous agents plus domain experts to tighten the loop.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take

Summary

From Magnus Carlsen to a chess coach in your pocket

A quick history lesson: chess and AI have been tangled together for decades

Why LLMs fall apart over the board

The real pipeline: Stockfish thinks, detectors inspect, the LLM translates

Making commentary feel useful instead of generic

The downvote-to-PR loop is the most modern part of the whole thing

Fast enough to feel instant, even if smarter models exist

Their closing lesson: build the plumbing, then let the model speak

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

From Magnus Carlsen to a chess coach in your pocket

A quick history lesson: chess and AI have been tangled together for decades

Why LLMs fall apart over the board

The real pipeline: Stockfish thinks, detectors inspect, the LLM translates

Making commentary feel useful instead of generic

The downvote-to-PR loop is the most modern part of the whole thing

Fast enough to feel instant, even if smarter models exist

Their closing lesson: build the plumbing, then let the model speak

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks