AI EngineerJune 29, 202630m

Frontier results, on device - RL Nabors, Arize

TL;DR

SLMs consume about 25% of the energy that LLMs require for equivalent tasks, with task-specific models using half that, making local inference dramatically more efficient.
The prototype big, deploy small framework means proving feasibility with the most capable model first, then systematically testing smaller models until you find one that meets your criteria.
Llama 3.2 3B beat Gemma 4 and Qwen in the speaker's evaluation for social thread summarization, reaching 90% accuracy compared to Claude Sonnet's baseline.
Few-shot prompting closed the accuracy gap between Llama and Claude from 10% to near-parity, while explicit negative constraints actually made performance worse.
Post-processing can fix structural issues like JSON validity and reference accuracy, meaning you don't need perfect model outputs to ship production-ready features.
Regression evals prevent CTO-induced disasters, catching when prompt or model changes break your agentic workflows before users notice.

The Breakdown

RL Nabors from Arize demonstrates how to replace expensive frontier model API calls with local small language models through a systematic evaluation process, showing that Llama 3.2 3B can match Claude Sonnet's performance on summarization tasks while eliminating API costs entirely. The talk introduces the SAGE model approach, selecting the smallest model that delivers acceptable results, and walks through a real case study using Phoenix to evaluate and compare models.