I get 20 t/s with a 70B 2.5bpw model, but this is only 47% of the theoretical maximum of 3090.
In comparison, the benchmarks on the exl2 github homepage show 35 t/s, which is 76% the theoretical maximum of 4090.
The bandwidth differences between the two GPUs aren’t huge, 4090 is only 7-8% higher.
Why? Does anyone else have a similar 20 t/s ? I don’t think my cpu performance is the issue.
The benchmarks also show ~85% utilization on 34B on 4bpw (normal models)
Hi! What are your settings for Ooba to get this to work? On Windows 11 on a single 3090, I keep getting CUDA out of memory error trying to load a 2.4bpw 70B model with just 4k context. It’s annoying because this used to work but after a recent update it just won’t load anymore.
8k with 2.4bpw and 20 t/s, the vram usage says 23.85/24.00 gb.
16k with 2.4bpw 20 t/s with fp8 cache
I have 0.5-0.6gb used for driving the monitor graphics on ubuntu.
Did you disable the nvidia system memory fallback that they pushed on Windows users? That’s probably what you need.
Thanks for the detailed answer! Ubuntu does seem to be much more memory-efficient compared to Windows. However, the problem just fixed itself seemingly overnight. Now I’m not running into out of memory errors. 8-bit cache is a godsend for vram efficiency.