I want to run a 70B LLM locally with more than 1 T/s. I have a 3090 with 24GB VRAM and 64GB RAM on the system.
What I managed so far:
- Found instructions to make 70B run on VRAM only with a 2.5B that run fast but the perplexity was unbearable. LLM was barely coherent.
- I randomly made somehow 70B run with a variation of RAM/VRAM offloading but it run with 0.1 T/S
I saw people claiming reasonable T/s speeds. Sine I am a newbie, I barely can speak the domain language, and most instructions I found assume implicit knowledge I don’t have*.
I need explicit instructions on what 70B model to download exactly, which Model loader to use and how to set parameters that are salient in the context.
I actually don’t know how much overhead that’s going to be. I’d start by just kicking it off on the command line first as a proof of concept, its super easy,
5_K_M is just the quantization I use. There’s almost no loss of perplexity with 5_K_M, but its also larger than 4 which is what most people use.