Back to Podcast Digest
0xSero39m

What exactly is REAP | LMStudio guide

TL;DR

  • REAP is targeted compression for mixture-of-experts models: Instead of shrinking every weight equally, router-weighted activation pruning watches which experts light up on calibration prompts and removes the parts you do not need.

  • The size savings are huge: Sharif says a 60 GB model can drop to roughly 12.5% of its original size with recommended pruning plus quantization, and a 1.5 TB GLM 5.1 setup was brought down to about 370 GB.

  • Big models are worth cutting down because their remaining weights are often better than a native small model: His argument is that frontier models like GLM 5.1 benefit from better data, better engineers, and more training tokens, so preserving 30% of a huge model can beat starting with a much smaller one.

  • Quantization and pruning are different tools: Quantization changes 16-bit weights into smaller representations like 4-bit, while pruning literally removes parts of the model, and both can be combined depending on your hardware budget.

  • The economics of hosted AI look shaky: Sharif claims the same task can cost 5 to 11 times more than it did three years ago because of thinking tokens, tool calls, and larger prompts, and says heavy users on $20 to $200 subscriptions are often being subsidized.

  • You do not necessarily need to retrain the router after pruning experts: In the Q&A, he says the router can repoint to the remaining salient experts out of the box, though retraining can improve performance.

The Breakdown

A 1.5 TB frontier MoE model can be cut to about 12.5% of its original size and still perform almost the same on the tasks you care about. Sharif breaks down REAP, why giant models are getting economically absurd to run, and how pruning plus quantization can turn datacenter-only systems into something usable at home or inside a company.

Was This Useful?

Share