https://arxiv.org/abs/2311.10770
“UltraFastBERT”, apparently a variant of BERT, that uses only 0.3% of it’s neurons during inference, is performing on par with similar BERT models.
I hope that’s going to be available for all kinds of models in the near future!
Would be interesting to see if this can help speed up CPU inference with regular RAM, after all 128 GB of DDR5 only costs like $300 which is peanuts compared to trying to get any where close as much VRAM.
If it scales linearly then one could run a 100B model at the speed of a 3B one right now.
I am just gonna do some bad maths.
For the price of single 4090 you can get
CPU Mainboard combo with 16 ram slots. $1,320
16 x 32 GB ddr4 ram $888
Mistral 7B runs around 7 tokens per second on a regular CPU, that is like 5 words per second.
On above setups 512 GB ram size we can fit a 512B parameters model, that will run 5*7/512=0.068 words per second with the current architecture, if this new architecture actually works and give 78x speed up it will be 5.3 words per second, the average persons reading speed is around 4 words per second. And average persons speaking speed is around 2 words per second.
Fingers crossed this can put a small dent on Nvidia’s stock price.
I doubt it, most of their leverage is in being the only suppliers of hardware required for pretraining foundational models. This doesn’t really change that.
If it works that way, it will only be short term. Since the only reason it doesn’t run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they’ll will be back on top with the same margins again.
Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.