AI EngineerJune 10, 202620m

Stop Making Models Bigger, Make Them Behave — Kobie Crawdord, Snorkel

TL;DR

A 4B model outperformed a 235B model on a finance tool-use task: Snorkel and UC Berkeley's RLLM team used RL to make a much smaller model behave better in a constrained FinQA environment.
The failure mode was tool discipline, not intelligence: The 235B model guessed SQL against a non-existent table and then hallucinated, while the fine-tuned 4B model first called get_table_names, checked schema, and corrected its own query after an error.
The training was cheap and fast: Crawford says the GRPO run took about 21 hours and cost less than $500 per run, making this a practical path for teams that want smaller, self-hosted models.
Single-table training worked better than more complex curricula: Surprisingly, training only on single-table questions produced the strongest uplift, yet the gains generalized to harder multi-table FinQA Reasoning tasks too.
Pass@1 roughly doubled after training: On the harder benchmark, performance jumped from 13.9 to 26.6, which Crawford frames as evidence that fixing the right behavior can matter more than scaling model size.
Rubrics help find what behavior to train: Snorkel's broader point is to use detailed eval rubrics to locate the exact failure mode, then build targeted data and RL loops around that instead of defaulting to a bigger model.

The Breakdown

A 4B model beat a 235B model on financial tool use after a 21-hour RL run that cost under $500, because the real problem was not raw reasoning power but sloppy tool behavior. Kobie Crawford shows how Snorkel and UC Berkeley improved performance by training the model to inspect tables, read schemas, and recover from errors instead of hallucinating answers.