AI Engineer·April 27, 2026·19m

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

TL;DR

Gemma 4 is Google DeepMind’s biggest open-model reset yet — Cassidy Hardin says the new family spans four models, from on-device 2B/4B-class systems to 26B MoE and 31B dense models, with the 31B ranking #3 on the global arena and both larger models landing in the top six open-source models on LM Arena.
Google switched Gemma to Apache 2.0 on purpose — Hardin frames the licensing change as a developer-first move so teams can test, integrate, and deploy Gemma more freely across the full product lifecycle.
The 26B model hides a much smaller inference footprint — it’s Gemma’s first mixture-of-experts model, with 128 experts but only 8 activated per forward pass, requiring about 3.8–3.9B active parameters while still delivering near-flagship performance.
Gemma 4’s efficiency gains come from very specific architecture tweaks — DeepMind interleaves local and global attention at a 5:1 ratio, uses 512-token sliding windows on smaller models and 1,024 on larger ones, and applies grouped-query attention with 8 queries per KV head in global layers to cut memory cost.
The small on-device models use per-layer embeddings to escape VRAM limits — the “effective 2B” model uses 2.3B active parameters but 5.1B representational parameters, with 256-dimension per-layer embedding tables stored in flash memory instead of VRAM so phones and laptops can run stronger models locally.
Multimodality got much more practical, especially for images — instead of Gemma 3’s “pan and scan” workaround, Gemma 4 supports variable aspect ratios, variable resolutions, and configurable image token budgets, while E2B and E4B also add audio via a 35M-parameter conformer for speech recognition and translation.

The Breakdown

Cassidy’s opening pitch: tiny open models, surprisingly big leap

Cassidy Hardin opens with real excitement: Gemma 4 is supposed to set “a new precedent” for small open-source models. She lays out the family clearly — two smaller on-device models for phones, iPads, and laptops, plus a 26B MoE and a 31B dense model — and emphasizes that the jump from Gemma 3 is unusually large.

The headline numbers: 31B dense and 26B MoE punch above their size

The larger models are the flex point here. Hardin says the 31B multimodal reasoning model ranks #3 on the global arena leaderboard, outperforming models more than 20x its size, while the 31B and 26B both sit in the top six open-source models on LM Arena. She also highlights a 256k context window, native thinking, function calling, and structured JSON output — all framed as built for autonomous workflows.

Apache 2.0 and the “everyday developer” message

One of the biggest non-architecture announcements is the license change: Gemma is now under Apache 2.0. Hardin presents that as a deliberate choice to make the models easier to use from first experiments through deployment, not just as a research release you admire from a distance.

What changed under the hood: local/global attention and grouped-query tricks

On the dense architecture side, DeepMind changed attention in a pretty surgical way. Gemma 4 interleaves local and global layers at a 5:1 ratio (4:1 for the smallest model), with local layers using sliding windows of 512 tokens on smaller models and 1,024 on larger ones, while the final layer is always global so it can attend to all previous tokens. Because global attention is expensive, they added grouped-query attention — 2 queries per KV head in local layers and 8 in global layers — then compensated by doubling global KV head length to 512.

The first Gemma MoE: 128 experts, only 8 active

Hardin then introduces Gemma’s first MoE model, the 26B. The setup is one always-on shared expert, three times larger than a normal expert, plus 128 total experts where a router picks 8 per forward pass. The point of the design is straightforward: keep performance high without paying dense-model inference costs every time.

Why the “effective” 2B and 4B models matter for real devices

The small models get a surprisingly clever explanation. Hardin distinguishes active parameters from “representational depth,” saying the effective 2B uses 2.3B parameters operationally while carrying 5.1B representational parameters. The key trick is per-layer embeddings: each layer gets its own 256-dimensional embedding table stored in flash memory instead of VRAM, which she frames as a direct answer to the memory bottlenecks on phones and laptops.

Multimodal went from patched-on to native

Hardin says Gemma 4 was built multimodal from the start, not retrofitted. The 31B and 26B use a 550M-parameter vision encoder, while the smaller models use a 150M encoder, and developers can now choose image resolutions and soft token budgets across five settings. She spends time on why this matters: in Gemma 3, “pan and scan” could turn one oddly shaped image into multiple square crops processed sequentially, but Gemma 4 now handles variable aspect ratios and resolutions directly.

Audio, image token budgets, and how to actually try it

On the audio side, E2B and E4B add speech recognition and translation using an audio tokenizer plus a 35M-parameter conformer. Hardin closes by bringing it back to use: small models for on-device text/vision/audio, large models for coding, reasoning, and agentic workflows. You can self-host from Hugging Face, Kaggle, or Ollama, or use the 26B and 31B in AI Studio and Vertex for faster prototyping.