How to run 70B on 24GB VRAM ?

BlueMetaMind@alien.top · 2 years ago

How to run 70B on 24GB VRAM ?

DarthNebo@alien.top · 2 years ago

Have you tried with FP4 & RAM offloading combined?

mrjackspade@alien.top · 2 years ago

If you’re only getting 0.1 then you’ve probably overshot your layer offloading.

I can get up to 1.5 t/s with a 3090, at 5_K_M

Try running Llama.cpp from the command line with 30 layers offloaded to the gpu, and make sure your thread count is set to match your (physical) CPU core count

The other problem you’re likely running into is that 64gb of RAM is cutting it pretty close. Make sure your base OS usage is below 8GB if possible and try memory locking the model on load. The problem is that with that amount of system ram, its possible you have other applications running causing the OS to page the model data out to disk, which kills performance

GoofAckYoorsElf@alien.top · 2 years ago

that 64gb of RAM is cutting it pretty close

Holy crap…

BlueMetaMind@alien.top · 2 years ago

Yeah… I thought I’ll be at least “in the room “ buying my setup last year, but it turns out I’m outside in the gutter 🫣😢

BlueMetaMind@alien.top · 2 years ago

Thank you. What does " at 5_K_M" mean ?
Can I use the text web UI with Llama.cpp as model loader or is this too much overhead for ?

mrjackspade@alien.top · 2 years ago

I actually don’t know how much overhead that’s going to be. I’d start by just kicking it off on the command line first as a proof of concept, its super easy,

5_K_M is just the quantization I use. There’s almost no loss of perplexity with 5_K_M, but its also larger than 4 which is what most people use.

Name	Quant method	Bits	Size	Max RAM required	Use case
goat-70b-storytelling.Q2_K.gguf	Q2_K	2	29.28 GB	31.78 GB	smallest, significant quality loss - not recommended for most purposes
goat-70b-storytelling.Q3_K_S.gguf	Q3_K_S	3	29.92 GB	32.42 GB	very small, high quality loss
goat-70b-storytelling.Q3_K_M.gguf	Q3_K_M	3	33.19 GB	35.69 GB	very small, high quality loss
goat-70b-storytelling.Q3_K_L.gguf	Q3_K_L	3	36.15 GB	38.65 GB	small, substantial quality loss
goat-70b-storytelling.Q4_0.gguf	Q4_0	4	38.87 GB	41.37 GB	legacy; small, very high quality loss - prefer using Q3_K_M
goat-70b-storytelling.Q4_K_S.gguf	Q4_K_S	4	39.07 GB	41.57 GB	small, greater quality loss
goat-70b-storytelling.Q4_K_M.gguf	Q4_K_M	4	41.42 GB	43.92 GB	medium, balanced quality - recommended
goat-70b-storytelling.Q5_0.gguf	Q5_0	5	47.46 GB	49.96 GB	legacy; medium, balanced quality - prefer using Q4_K_M
goat-70b-storytelling.Q5_K_S.gguf	Q5_K_S	5	47.46 GB	49.96 GB	large, low quality loss - recommended
goat-70b-storytelling.Q5_K_M.gguf	Q5_K_M	5	48.75 GB	51.25 GB	large, very low quality loss - recommended
goat-70b-storytelling.Q6_K.gguf	Q6_K	6	56.59 GB	59.09 GB	very large, extremely low quality loss
goat-70b-storytelling.Q8_0.gguf	Q8_0	8	73.29 GB	75.79 GB	very large, extremely low quality loss - not recommended

TuuNo_@alien.top · 2 years ago

I would suggest you to use Koboldcpp and run GGUF. A 70B Q5 model, with around 40 layers loaded into GPU, should have more than 1t/s. At least for me, I got 1.5t/s with 4090 and 64GB ram using Q5_K_M.

silenceimpaired@alien.top · 2 years ago

I could never get up and running on Linux with Nvidia. I used Kobold on Windows, but boy is it painful on Linux.

TuuNo_@alien.top · 2 years ago

Well, I have never used Linux before since the main purpose of my pc is gaming. But I heard running LLMs on Linux is overall faster.

silenceimpaired@alien.top · 2 years ago

It is… but koboldcpp doesn’t have a executable for me to run :/

giblesnot@alien.top · 2 years ago

I don’t know what you were running into but I’m running Pop_OS 22.04 (a modified version of Ubuntu,) as my OS with a 3090 and everything I have tried I just follow the basic install instructions on the home page and it works. Ooga booga, Automatic1111, Tortoise TTS, Whisper STT, Bark, Kobald, etc. I just follow the “run these commands” linux instructions and everything is groovy.

silenceimpaired@alien.top · 2 years ago

I’m on Pop, lol. I could get it to compile, but I must have missed a step for nvidia acceleration