https://arxiv.org/abs/2311.10770
“UltraFastBERT”, apparently a variant of BERT, that uses only 0.3% of it’s neurons during inference, is performing on par with similar BERT models.
I hope that’s going to be available for all kinds of models in the near future!
Here are my notes:
Overview:
Benchmarks Averages
Benchmarks that don’t degrade at all as more neurons are ignored:
Benchmarks that degrade:
Benchmarks that degrade substantially:
CoLA, which is addressed in the paper:
Corpus of Linguistic Acceptability (CoLA): sentences annotated as grammatically acceptable or not by experts.
Applicability to CausalLM such as Llama 2
With substantially more FF layers in Llama 2, this is concerning. Additionally, it’s not obvious to me that this works with a 7B to 70B parameter causal language model just because it it works with a ~100M parameter bidirectional encoder. Would be great to see it tested however!
Other
To add to that: GPUs do support “conditional” matrix multiplication, they just don’t benefit from that type of optimization. Essentially, it takes as much time to skip a computation as it does to perform it. And in practice it can even take longer since the extra logic required to keep track of which computations to skip will add overhead.
In order for this to make sense on a GPU you need a way of completely sidestepping portions of the model, like the ability to skip whole layers that are not relevant (a bit how MoE works already). If you have to load a weight from memory, or some sort of metadata to figure out what each individual weight is connected to, you’ve already allocated as many resources to that weight as you would if you simply used it in a streamlined matrix multiplication.
The same also holds to a lesser extent for efficient CPU implementations that also rely on SIMD computations, regular memory layouts and predictable control flows.