Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

vatsadev@alien.top · 2 years ago

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

Spirited_Employee_61@alien.top · 2 years ago

If we can fit 1 layer at a time, can we do 3 or 4 at a time? A bit bigger but a bit faster than 1 at a time. Or am I dreaming?

ron_krugman@alien.top · 2 years ago

That doesn’t make much of a difference. You still have to transfer the whole model to the GPU for ever single inference step. The GPU only saves you time if you can load the model (or parts of it) once and then do lots of inference steps.