AI EngineerApril 20, 202615m

Gemma, DeepMind's Family of Open Models — Omar Sanseviero, Google DeepMind

TL;DR

Gemma 4 is Google DeepMind’s most capable open model family yet, spanning 2B to 32B parameters — Omar Sanseviero frames the big win as “developer-friendly sizes,” with even the 31B model able to run on a consumer GPU and the smallest models fitting on phones and Raspberry Pi.
On-device AI is the point, not a side effect — he shows Gemma running fully offline in airplane mode on a phone for coding and agentic tasks, plus 10 parallel Gemma instances generating SVGs on a laptop via llama.cpp at roughly 100 tokens per second.
Google changed course on licensing with Gemma 4 and moved to Apache 2.0 — Sanseviero says the earlier Gemma licenses were a real community complaint, and the new release now gives developers the flexibility of a standard open-source license.
The weird “E” models are built for mobile efficiency through per-layer embeddings — Gemma E2B is effectively 2B active parameters even though the full model is around 4B+, because part of the architecture acts more like a lookup table that can sit on CPU or disk instead of GPU.
Gemma 4 leans hard into multilingual and multimodal use cases — it was trained on 140+ languages, inherits Gemini’s tokenizer research, and supports images, video, and audio, including examples like speech-to-translated-text and image understanding with Japanese text.
The bigger story is the ecosystem scale: 10 million Gemma 4 downloads in a week and 500 million across the family — Sanseviero highlights 1,000+ Gemma 4 derivatives already, support from Hugging Face, llama.cpp, vLLM, MLX, and Unsloth, plus variants like ShieldGemma and MedGemma.

The Breakdown

A week after launch, Gemma 4 gets its first big stage

Omar Sanseviero opens with the timing: Gemma 4 shipped just seven days earlier, and this is his first conference talk about it. He positions Gemma as Google DeepMind’s family of open models you can download, run on your own hardware, and fine-tune yourself — not just a hosted API experience.

The pitch: tiny models that still punch hard

He contrasts Gemma 4 with the previous generation, saying Gemma 3 was already strong enough to fit on a single consumer GPU and still score well in LM Arena. The new family runs from 2B to 32B, with a clear split: tiny multimodal models for phones and Raspberry Pi, a fast MoE model for low latency, and a 31B model for “raw intelligence” that still fits on consumer gear.

Live demos of offline, on-device agents

This is the part he’s visibly excited about: Gemma playing piano via selectable “skills” on an Android phone, coding on-device in airplane mode with no API calls, and 10 parallel Gemma instances generating SVGs on a laptop. His point isn’t abstract benchmarking — it’s that these models can already do agentic, coding-heavy workflows locally with llama.cpp at about 100 tokens per second.

Why small-and-good matters more than big-and-flashy

Sanseviero admits LM Arena isn’t perfect, but uses it as a proxy for whether people actually like using the models. What excites him most is the trendline: over two years, Gemma keeps getting better without getting bigger, which feeds his bigger thesis that highly capable models will increasingly live “in our own devices, in our own pockets.”

The two practical upgrades: Apache 2.0 and the new “E” architecture

He addresses a sore spot directly: the old Gemma license “was not great,” so Gemma 4 now uses Apache 2.0. Then he explains the odd “E2B” label — these models use per-layer embeddings, a lookup-table-like design where only about 2B parameters need to be active on GPU while the rest can sit on CPU or disk, making them unusually well suited for mobile.

Multimodal, multilingual, and tuned for real-world languages

The smallest Gemma 4 models can handle images, video, and audio, including speech recognition and translating spoken Spanish into written French. He also stresses the tokenizer story: Gemma was trained on 140+ languages with Gemini-derived multilingual infrastructure, which makes it unexpectedly strong for low-resource fine-tuning use cases like Quechua or official Indian languages.

The ecosystem is moving fast around Gemma

One week in, Gemma 4 had already hit 10 million base-model downloads, with 1,000+ community variants and 500 million downloads across the whole family. Sanseviero keeps returning to the same idea: success isn’t just the model itself, it’s making sure people can use it where they already are — Hugging Face, llama.cpp, vLLM, MLX, Unsloth, C Lang — without being forced into some Google-only stack.

From Android Studio to cancer research, the ambition is wider than chat

He spotlights Android Studio’s offline agent mode as a concrete product integration, with Gemma specifically trained on Android-related data to help with app development. Then he zooms out to the “Gemmaverse”: ShieldGemma for moderation, MedGemma for medical imaging tasks, multilingual work from AI Singapore and Sarvam, and even a DeepMind paper where Gemma-based research proposed cancer therapy pathways that were later validated in a real lab.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Gemma, DeepMind's Family of Open Models — Omar Sanseviero, Google DeepMind

The Breakdown

A week after launch, Gemma 4 gets its first big stage

The pitch: tiny models that still punch hard

Live demos of offline, on-device agents

Why small-and-good matters more than big-and-flashy

The two practical upgrades: Apache 2.0 and the new “E” architecture

Multimodal, multilingual, and tuned for real-world languages

The ecosystem is moving fast around Gemma

From Android Studio to cancer research, the ambition is wider than chat

Was This Useful?

Or just get notified

Read Next

Tasteful Skills

The Art of Tasteful Prompting

The Codex /goal Playbook

The Breakdown

A week after launch, Gemma 4 gets its first big stage

The pitch: tiny models that still punch hard

Live demos of offline, on-device agents

Why small-and-good matters more than big-and-flashy

The two practical upgrades: Apache 2.0 and the new “E” architecture

Multimodal, multilingual, and tuned for real-world languages

The ecosystem is moving fast around Gemma

From Android Studio to cancer research, the ambition is wider than chat

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

Tasteful Skills

The Art of Tasteful Prompting

The Codex /goal Playbook