AI EngineerJune 10, 202620m

Stop Making Models Bigger, Make Them Behave — Kobie Crawford, Snorkel

TL;DR

A tiny model beat a giant one on the target task: Snorkel and UC Berkeley's RLLM team trained a 4B model to outperform a 235B Qwen3 model on financial analysis tool use, with pass@1 roughly doubling after RL.
The failure mode was not reasoning, it was behavior: The 235B model guessed SQL against a non-existent table, failed twice, then hallucinated an answer, while the fine-tuned 4B model first called get_table_names, inspected schema, and corrected its own query error.
The training run was cheap enough to matter: The RL job used GRPO, ran in about 21 hours, and cost under $500 per run, which Crawford framed as proof that behavior tuning can be practical for teams trying to ship smaller on-prem models.
Single-table training generalized better than expected: Training only on single-table questions produced the best uplift, yet still improved the harder multi-table FinQA Reasoning benchmark from 13.9 to 26.6.
High-quality expert data was the core bet: Snorkel built the dataset with domain experts in the loop, verified tasks and answers, and argued that carefully chosen data is what lets RL target the exact behavior a production system is missing.
Rubrics help find the real bug before RL starts: Crawford said richer eval rubrics can break a model's failures into specific behaviors, so teams can identify whether they need more knowledge, better tool use, or another targeted fix instead of just swapping in a bigger model.

The Breakdown

A 4 billion parameter model beat a 235 billion parameter model on financial tool use after a 21-hour RL run that cost under $500, because the real problem was not reasoning depth. It was tool discipline: learning to inspect tables, read schemas, recover from errors, and stop hallucinating.