Hi. I’m currently running a 3060 12Gb | R7 2700X | 32gb 3200 | Windows 10 w/ latests nvidia drivers (vram>ram overflow disabled). By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.
Are these values what is expected of my setup? Or is there something i can do to improve speeds without changing the model?
Its pretty much unusable at this state, and since it’s hard to find information about this topic i figured i would try to ask here.
EDIT: running the model on the latest version of the text-generation-webui
some of my results:
System:
2xXeon E5-2680v4, 28 cores total, 56 HT, 128 Gb RAM
RTX 2060 6 Gb via PCIE x16 3.0
RTX 4060 Ti 16 Gb via PCIE x8 4.0
Windows 11 Pro
OpenHermes-2.5-AshhLimaRP-Mistral-7B (llama.cpp in text generation UI):
Q4_K_M,RTX 2060 6 Gb RAM, all 35 layers offloaded, 8k context, - approx 3 t/s
Q5_K_M,RTX 4060 Ti 16 Gb RAM, all 35 layers offloaded, 32k context - approx 25 t/s
Q5_K_M,CPU-only , 8 threads,32k context - approx 2.5-3.5 t/s
Q5_K_M,CPU-only , 16 threads,32k context - approx 3-3.5 t/s
Q5_K_M,CPU-only , 32 threads,32k context - approx 3-3.6 t/s
euryale-1.3-l2-70b (llama.cpp in text generation UI)
Q4_K_M,RTX 2060+RTX 4060 Ti,35 layers offloaded, 4K context - 0.6-0.8 t/s
goliath-120 (llama.cpp in text generation UI)
Q2_K, CPU-only,32 threads - 0.4-0.5 t/s
Q2_K, CPU-only,8 threads - 0.25-0.3 t/s
Noromaid-20b-v0.1.1 (llama.cpp in text generation UI)
Q5_K_M , RTX 2060+RTX 4060 Ti, 65 layers offloaded,4K context - approx 5 t/s
Noromaid-20b-v0.1.1 (exllamav2 in text generation UI)
3bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, approx 15 t/s (looks like it fits in 4060)
6bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, no flash attention, gpu split 12, 6 - approx 10 t/s
Observations:
- number of cores in cpu-only modes matters very little
- “numa” does matter (I have 2 CPU sockets)
I would say - try to get additional another card?
I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?
It’s been a couple months since I used less-than-complete GPU offloading; When I was using my Alienware laptop (i7-8th gen, 2060 6GB) to run 13B models with 13/25 layers offloaded I was getting 1-2 t/s, so yours sounds low.
I have a 3080 12Gb, and can run a 20B-Q4_K_M with about 50 layers offloaded and 8k context.
It starts off at just under 4 t/s, and once the context is filled it gets as slow as just over 2 t/s
It might be worth setting up a linux partition to boot into for this, I was getting much slower speeds under windows.
That might be worth a try actually, i’ll look into it, thanks
R5 5500 (on stock 3600Mhz) | 3060 12gb | 32gb 3600, Win10 v2004.
I’m using LM Studio for heavy models (34b (q4_k_m), 70b (q3_k_m) GGUF.
On 70b I’m getting around 1-1.4 tokens depending on context size (4k max),
I’m offloading 25 layers on GPU (trying to not exceed 11gb mark of VRAM),
On 34b I’m getting around 2-2.5 tokens depending on context size (4k max).
I’m offloading 30 layers on GPU (trying to not exceed 11gb mark of VRAM),
On 20b I was getting around 4-5 tokens, not a huge user of 20b right now.So I can recommend LM Studio for models heavier then 13b+, woks better for me.
Here is a 34b YI Chat generation speed:By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.
Sound suspicious. A use Yi-Chat-34b-Q4_K_M on old 1080ti (11 gb VRAM) with 20 layers offloaded and got around 2.5 t/s.But it is on Threadripper 2920 with 4 channel RAM (also 3200). However I don’t think it would make that much difference. Ofcourse in 4 channel I have ram bandwidth x2 of your’s but I run 34b and I load only 20 layers on gpu…