https://arxiv.org/abs/2311.10770
“UltraFastBERT”, apparently a variant of BERT, that uses only 0.3% of it’s neurons during inference, is performing on par with similar BERT models.
I hope that’s going to be available for all kinds of models in the near future!
I wonder if you can pass a large dataset of prompts to perform a certain relatively narrow task and see which neurons get activated. And then use statistical measures to add a few surrounding neurons just in case.
Bet you can get away with near zero reduction in size and massive parameter compression.