40x or more speedup by selecting important neurons

koehr@alien.top · 2 years ago

40x or more speedup by selecting important neurons

MoffKalast@alien.top · 2 years ago

Would be interesting to see if this can help speed up CPU inference with regular RAM, after all 128 GB of DDR5 only costs like $300 which is peanuts compared to trying to get any where close as much VRAM.

If it scales linearly then one could run a 100B model at the speed of a 3B one right now.

OnurCetinkaya@alien.top · 2 years ago

I am just gonna do some bad maths.

For the price of single 4090 you can get

CPU Mainboard combo with 16 ram slots. $1,320

16 x 32 GB ddr4 ram $888

total 512 GB

Mistral 7B runs around 7 tokens per second on a regular CPU, that is like 5 words per second.

On above setups 512 GB ram size we can fit a 512B parameters model, that will run 5*7/512=0.068 words per second with the current architecture, if this new architecture actually works and give 78x speed up it will be 5.3 words per second, the average persons reading speed is around 4 words per second. And average persons speaking speed is around 2 words per second.

Fingers crossed this can put a small dent on Nvidia’s stock price.

MoffKalast@alien.top · 2 years ago

I doubt it, most of their leverage is in being the only suppliers of hardware required for pretraining foundational models. This doesn’t really change that.

fallingdowndizzyvr@alien.top · 2 years ago

Fingers crossed this can put a small dent on Nvidia’s stock price.

If it works that way, it will only be short term. Since the only reason it doesn’t run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they’ll will be back on top with the same margins again.

Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.