Any M2 ultra reviews?

EvokerTCG@alien.top · 2 years ago

Any M2 ultra reviews?

aikitoria@alien.top · 2 years ago

Is it not possible to port ExLlamaV2 to metal? At least on a 4090, it’s much (much) faster at processing the input than llama.cpp

SomeOddCodeGuy@alien.top · 2 years ago

I imagine there’s a lot of work to do so, but I can’t imagine it’s impossible. Probably just not something folks are working on.

I don’t particularly mind too much, because the quality difference between exl2 and gguf is hard for me to work past. Just last night I was trying to run this NeuralChat 7b everyone is talking about on my windows machine in 8bpw exl2, and it was SUPER fast, but the model was so easily confused; before giving up on it, I grabbed the q8 gguf and swapped to it (with no other changes) and suddenly saw why everyone was saying that model is so good.

I don’t mind speed loss if I get quality, but I can’t handle quality loss to get speed. So for now, I really don’t mind only using gguf, because it’s perfect for me.

aikitoria@alien.top · 2 years ago

Hmm, I didn’t notice a major quality loss when I swapped from mistral-7b-openorca.Q8_0.gguf (running in koboldcpp) to Mistral-7B-OpenOrca-8.0bpw-h6-exl2 (running in text-gen-webui). Maybe I should try again. Sure you were using comparable sampling settings for both? I noticed for example SillyTavern has entirely different presets per backend.

Still need to try the new NeuralChat myself also, I was just going to go for the exl2, so this could be a good tip!