I get 20 t/s with a 70B 2.5bpw model, but this is only 47% of the theoretical maximum of 3090.
In comparison, the benchmarks on the exl2 github homepage show 35 t/s, which is 76% the theoretical maximum of 4090.
The bandwidth differences between the two GPUs aren’t huge, 4090 is only 7-8% higher.
Why? Does anyone else have a similar 20 t/s ? I don’t think my cpu performance is the issue.
The benchmarks also show ~85% utilization on 34B on 4bpw (normal models)
8k with 2.4bpw and 20 t/s, the vram usage says 23.85/24.00 gb.
16k with 2.4bpw 20 t/s with fp8 cache
I have 0.5-0.6gb used for driving the monitor graphics on ubuntu.
Did you disable the nvidia system memory fallback that they pushed on Windows users? That’s probably what you need.
Thanks for the detailed answer! Ubuntu does seem to be much more memory-efficient compared to Windows. However, the problem just fixed itself seemingly overnight. Now I’m not running into out of memory errors. 8-bit cache is a godsend for vram efficiency.