So I’m considering getting a good LLM rig, and the M2 Ultra seems to be a good option for large memory, with much lower power usage/heat than 2 to 8 3090s or 4090s, albeit with lower speeds.

I want to know if anyone is using one, and what it’s like. I’ve read that it is less supported by software which could be an issue. Also, is it good for Stable Diffusion?

Another question is about memory and context length. Does a big memory let you increase the context length with smaller models where the parameters don’t fill the memory? I feel a big context would be useful for writing books and things.

Is there anything else to consider? Thanks.

  • aikitoria@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Is it not possible to port ExLlamaV2 to metal? At least on a 4090, it’s much (much) faster at processing the input than llama.cpp

    • SomeOddCodeGuy@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I imagine there’s a lot of work to do so, but I can’t imagine it’s impossible. Probably just not something folks are working on.

      I don’t particularly mind too much, because the quality difference between exl2 and gguf is hard for me to work past. Just last night I was trying to run this NeuralChat 7b everyone is talking about on my windows machine in 8bpw exl2, and it was SUPER fast, but the model was so easily confused; before giving up on it, I grabbed the q8 gguf and swapped to it (with no other changes) and suddenly saw why everyone was saying that model is so good.

      I don’t mind speed loss if I get quality, but I can’t handle quality loss to get speed. So for now, I really don’t mind only using gguf, because it’s perfect for me.

      • aikitoria@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Hmm, I didn’t notice a major quality loss when I swapped from mistral-7b-openorca.Q8_0.gguf (running in koboldcpp) to Mistral-7B-OpenOrca-8.0bpw-h6-exl2 (running in text-gen-webui). Maybe I should try again. Sure you were using comparable sampling settings for both? I noticed for example SillyTavern has entirely different presets per backend.

        Still need to try the new NeuralChat myself also, I was just going to go for the exl2, so this could be a good tip!