QuIP# is a novel quantization method. Its 2-bit performance is better than anything previously available.
Repository: https://github.com/Cornell-RelaxML/quip-sharp
Blog post: https://cornell-relaxm...
With Llama-2-70b-chat-E8P-2Bit from their zoo, quip# seems fairly promising. I’d have to try l2-70b-chat in exl2 at 2.4 bpw to compare but this model does not really feel like a 2 bit model so far, I’m impressed.
With Llama-2-70b-chat-E8P-2Bit from their zoo, quip# seems fairly promising. I’d have to try l2-70b-chat in exl2 at 2.4 bpw to compare but this model does not really feel like a 2 bit model so far, I’m impressed.
From the issue about this in the exllamav2 repo, quip was using more memory and slower than exl. How much context can you fit?