How LLMs Actually Work - By 0xkato
TL;DR
LLMs never read words directly: They read token IDs from a fixed vocabulary, which is why classic failures like counting the R's in "strawberry" came from tokenization, not a simple inability to count letters.
Embeddings turn IDs into meaning, but position has to be added separately: A token like 1024 is just a row lookup until the embedding matrix gives it a vector, and modern models usually use RoPE instead of older sinusoidal or absolute position embeddings.
Attention is just learned matching between tokens: Query, key, and value vectors let a token like "was" look back at "cat" instead of "yesterday," and causal masking ensures GPT-style models only attend left to right.
Multi-head attention gives the model many parallel views of the same sequence: Heads are learned projections, not literal vector slices, and practical systems now often use grouped query attention, like Llama 2 70B with 64 query heads but only 8 key-value heads.
A lot of the model's stored knowledge seems to live in the feed forward network: The FFN holds most parameters in dense transformers, supports concept-specific activations, and can even be edited with methods like ROME to change facts such as "Eiffel Tower is in Paris."
Most frontier LLMs share one family skeleton: GPT, Claude, Gemini, Llama, Mistral, Gemma, and Qwen mostly differ in weights, data, scale, and post-training, while the common 2023 to 2025 stack converged on prenorm, RMSNorm, RoPE, SwiGLU, GQA, and sometimes mixture-of-experts.
The Breakdown
A modern LLM is mostly the same transformer stack repeated over and over: tokenize text into IDs, turn them into vectors, mix information with attention and feed forward layers, then predict one next token at a time. The big differences between GPT, Claude, Gemini, Llama, and others are less about a mysterious new architecture and more about trained weights, scale, configuration choices like RoPE or GQA, and post-training.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.