Back to Podcast Digest
AI Engineer21m

From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google

TL;DR

  • Google is pushing a two-track on-device strategy: system GenAI if it’s already on the phone, app-shipped tiny models if you need control — Cormac Brick frames AI Core/Gemini Nano as the easiest path and LiteRT/LiteRT-LM as the customizable fallback when you need boutique behavior or wider platform reach.

  • Gemma 4’s new skill flow lets tiny on-device agents feel practical, not just toy demos — in the open-source Google AI Edge Gallery app, a restaurant “roulette” and map skill are driven by prompt-based skill descriptions, on-demand skill loading, and custom JavaScript UI rendered inside the app.

  • The big lesson on tiny LLMs is specialization: once you drop to 200M–500M parameters, narrow tasks beat generality — Brick says models under 1B parameters can work well on-device, but only if they’re fixed-function or fine-tuned hard for specific jobs like function calling, transcription, or VLM tasks.

  • Fine-tuning turned Function Gemma from a mediocre app-intents model into a robust one — out-of-the-box accuracy was about 46%, but synthetic-data fine-tuning pushed it above 90% for 8 of 10 functions, which is the headline behind the talk’s “46% to 90%” claim.

  • Google’s own Eloquent transcription app is the ‘proof of life’ that tiny models can ship real product features — it chains a Gemma-3-based ASR model with a few-hundred-million-parameter text-polishing model to do offline transcription, personal dictionaries, and cleanup of “ums” and “ahs.”

The Breakdown

Start with the device, then decide how much AI you need

Brick opens with the practical case for on-device AI: latency, privacy, offline reliability, and cost. He positions Google’s AI Edge stack — MediaPipe, LiteRT-LM, and LiteRT — as the plumbing behind this, noting the runtime already reaches 2.7 billion+ devices and runs across CPU, GPU, and NPU.

System GenAI vs. shipping your own model

The first real framework in the talk is a choice: use system-level GenAI like Gemini Nano via AI Core if it covers your use case, or ship your own app-level GenAI if you need something more custom. His pitch is refreshingly blunt — AI Core is “great” because your app doesn’t get bigger, but app-shipped models give you full customization at the cost of more work.

The AI Edge Gallery app as Google’s live sandbox

Brick uses the Google AI Edge Gallery app — available on Android and iOS, with Android code open source — as the showcase for what local LLMs can actually do today. It supports chat, image Q&A, audio transcription, and third-party models like Qwen and Phi, but the star of this talk is the new skills system layered on top.

The restaurant roulette demo and how skills actually work

The memorable demo is silly in a good way: a restaurant skill builds a roulette wheel and picks a winner. Under the hood, it’s mostly prompting — the app injects skill descriptions into the prompt, the model decides it needs a skill, calls a built-in “load skill” tool, and then executes custom JavaScript to render the result in-app, whether that’s a roulette wheel or a Google Maps view.

Vibe-coding your own skills — and the community already is

Brick says the team has already made around 80 skills, often using Gemini CLI or Code Assist-style prompting to generate them. The whole thing is designed to be hackable: publish a skill to GitHub, load it in the app from a URL, and share it back with the community — which he notes started building examples almost immediately after launch.

Tiny models are getting real, but only if you respect their limits

On the tooling side, LiteRT-LM packages a model into a single file format and lets developers export from Transformers with LiteRT-Torch, test on desktop, then deploy to mobile. His examples make the point concrete: Apple’s FastVLM at 500M parameters can run fast on a Qualcomm NPU, while Function Gemma at 270M parameters is tuned for robust function calling rather than broad general chat.

The 46% to 90% jump came from synthetic-data fine-tuning

This is the punchline: for app intents, Function Gemma started around 46% success out of the box. After generating synthetic training data with Flash and fine-tuning for the app’s specific functions, the model cleared 90% on 8 of 10 functions, which Brick presents as the practical path to making tiny models reliable enough to ship.

Eloquent is the production proof that chained tiny models can work

He closes with Eloquent, an offline transcription app on iOS, as a real-world example of tiny LLMs in production. Instead of one giant model, it chains a Gemma-3-based ASR model with a separate text-polishing model, both only a few hundred million parameters, to support personal dictionaries and clean up messy spoken filler — a nice human touch because it solves the annoying stuff users actually notice.

What still isn’t solved: multi-skill orchestration in one shot

In Q&A, Brick is candid that skill selection across a conversation works “reasonably well” — for example, asking for a Wikipedia fact and then showing it on Google Maps. The harder unsolved problem is getting the model to call multiple skills inside one single response reliably, which he says still works only sometimes as the team figures out the limits of the harness.

Share