Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

vatsadev@alien.top · 2 years ago

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

xinranli@alien.top · 2 years ago

This seems like a very brilliant and almost obvious idea, is there a reason why this method wasn’t a thing before? Besides the PCIe bandwidth and storage speed requirements.

fallingdowndizzyvr@alien.top · 2 years ago

Because it wouldn’t be any faster than doing CPU inference. Since both CPUs and GPUs are already waiting around for data to process. It’s that i/o that’s the limiter. This changes none of that.

radianart@alien.top · 2 years ago

Is here a better way to use bigger models than can fit in RAM\VRAM? I’d want to try 70b or maybe even 120b but I only have 32\8gb.

TheTerrasque@alien.top · 2 years ago

70b? Q4, llama.cpp, some layers on gpu.

Might need to run Linux to get the system ram usage low enough