The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
Well… none at all if you’re happy with ~1 token per second or less using GGUF CPU inference.
I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.
You’d need 2 x 3090 or an A6000 or something to do it quickly.