Back to Podcast Digest
AI Engineer··10m

Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

TL;DR

  • Gemma 4 can hit ~40 tok/s on an iPhone with MLX — Adrien Grondin demos Google’s Gemma 4 running fully offline in his Locally AI app, calling that speed “more than acceptable” for many real-world use cases.

  • Apple’s MLX stack makes on-device LLM apps surprisingly easy to ship — Grondin says with the MLX Swift LM repo, you can download a model from Hugging Face and have an iOS app running locally in “less than 10 minutes.”

  • The real trick is model selection and quantization, not just framework setup — He recommends sticking roughly to 4-bit through 8-bit quantization on iPhone, warning that going below 4-bit usually starts to noticeably hurt output quality.

  • The MLX ecosystem is now much bigger than text chat — He points to MLX Swift LM for Apple apps, plus MLX VLM, MLX Audio, and MLX Video as signs that Apple Silicon local AI now spans text, vision, audio, and generation workflows.

  • Hugging Face’s MLX community is becoming the default distribution layer for Apple-local models — Grondin says the community hosts roughly 4,000–5,000 models and often gets freshly released models quantized within about 30 minutes.

  • Locally AI is now part of LM Studio, extending the local-model story beyond mobile — He closes by noting Locally was acquired by LM Studio, which lets users run models through engines like Llama.cpp and MLX and expose them through local OpenAI- or Anthropic-style APIs.

The Breakdown

Adrien’s pitch: local AI on iPhone is real now

Adrien Grondin introduces himself as the developer behind Locally AI, a native chatbot app for iPhone, iPad, and Mac that runs models on-device with Apple’s MLX. His core point lands fast: Google’s Gemma 4 isn’t just technically compatible with iPhone — the smaller variants are actually good and fast enough to matter.

MLX as the Apple-native engine underneath it all

He frames MLX as Apple’s framework for Apple Silicon, tuned for iPhone chips as well as Macs and iPads. The vibe here is practical, not theoretical: if you want to run a language model on Apple hardware, this is the stack he’s betting on, and it already supports not just Gemma but also Qwen and small Hugging Face models.

The “less than 10 minutes” path to your own app

For developers, the main pointer is the GitHub repo MLX Swift LM. Grondin doesn’t belabor implementation details — he jokes that “your agent” can do that — but says the API is straightforward enough that you can get a local-model iOS app up in under 10 minutes.

The MLX ecosystem is getting broad, fast

He zooms out to say MLX is no longer just about text. He name-checks Prince’s work on MLX VLM, MLX Audio, and MLX Video, arguing the ecosystem now covers visual models, audio, image generation, video generation, even “omni models” and speech-to-speech workflows.

Where the models come from: Hugging Face’s MLX community

The practical sourcing advice is to browse the MLX community on Hugging Face, where he says there are roughly 4,000 to 5,000 uploaded models. His notable claim: when a lab releases a new model, quantized MLX versions in 4-bit, 6-bit, and other variants often appear within about 30 minutes, making Apple deployment almost immediate.

Quantization is the real deployment lever

Grondin explains that on iPhone, full-size weights are usually too large, so quantization is mandatory. His rule of thumb is 4-bit to 8-bit; below 4-bit, he says, output quality starts dropping too much, while 8-bit makes sense mostly for smaller models. He also mentions tiny “liquid” models around 300–350 parameters that are fast enough to use in Shortcuts automations for lightweight text processing.

The live demo: 40 tok/s offline doesn’t feel like a compromise

He briefly restores a slide to show what 40 tokens per second actually looks like on-device, and the point is mostly visceral: the stream is just fast. Even older iPhones, he says, may only hit 20 tok/s, but that’s still plenty useful; the bigger current friction is model download size, around 1 GB to 3 GB depending on the model.

Locally’s next chapter and the Q&A on tool calling

Near the end, he drops the news that Locally has been acquired by LM Studio, which he describes as a local-model hub that can download models from Hugging Face, run them with engines like Llama.cpp or MLX, and expose local OpenAI- or Anthropic-style APIs. In Q&A, he confirms MLX Swift LM already supports tool calling, though not custom structured generation yet, and clarifies the split between the developer package on GitHub and the ready-to-use consumer app on the App Store.