• 0 Posts
  • 22 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle








  • Apple Silicon Macs are great options for running LLMs, especially so if you want to run a large LLM on a laptop. With that said, there aren’t big performance differences between the M1 Max & M3 Max, at least not for text generation, prompt processing does show generational improvements… Maybe this will change in future versions of MacOS if there are optimizations to unlock better Metal Performance Shader performance on later GPU generations, but for right now, they are pretty similar.

    Apple Silicon Macs aren’t currently a great option for training/fine tuning models. There isn’t a lot of software support for GPU acceleration during training on Apple Silicon.


  • FlishFlashman@alien.topBtoLocalLLaMAI have some questions
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If model size is a priority, the Apple Silicon macs (particularly used or factory refurbished Mac Studio Ultras) provide good value (cost + available memory + performance. Ie 4,679 for 128GB -- 96GB usable by GPU for model + working data). Workstation or multiple high end consumer GPUs can be faster, but also more expensive, more power consumption, bigger case, louder…)

    Software options for doing training or fine tuning on Macs using GPU are limited at this point, but will probably improve. This might also be something better done with short term rental of a cloud server.



  • What quantization are you using? Smaller tends to be faster.

    I get 30 tokens/s with a q4_0 quantization of 13B models on a M1 Max on Ollama (which uses llama.cpp). You should be in the same ballpark with the same software. You aren’t going to do much/any better than that. The M3’s GPU made some significant leaps for graphics, and little to nothing for LLMs.

    Allowing more threads isn’t going to help generation speed, it might improve prompt processing though. Probably best though to keep the number of threads to the number of performance cores.


  • When I load yarn-mistral-64k in Ollama (uses llama.cpp) on my 32GB MAC it allocates 16.359 GB for the GPU. I don’t remember how much the 128k context version needs, but it was more than the 21.845GB MacOS allows for the GPUs use on a 32GB machine. You aren’t going to get very far on a 16GB machine.

    Maybe if you don’t send any layers to the GPU and force it to use CPU you could eek out a little more. On Apple Silicon CPU inference only seems to be a 50% hit over GPU speeds, if I remember right.






  • I think part of the answer is that RAM uses more power than you think when it’s running near full-tilt, like it is during generation. Micron’s advice is to figure 3w per 8GB for DDR4, and more than that for the highest performance parts. The fact that the RAM is on package probably offsets that somewhat, but that’s still more than single digits.

    Power consumption on my 24Core GPU M1 Max is similar to yours, though somewhat lower as you’d expect, according to both iStat Menus and Stats.app.

    There is also the question of how accurate they are.



  • Apple is increasing differentiation amongst their chips. Previously the Pro and Max different primarily in GPU cores. Now they are also differentiated in CPU cores and memory bandwidth.

    I was disappointed to see that the M3 Maxs memory bandwidth is the same, on paper, as the M2 Max, but I’m also mindful of the fact that no one functional unit was able to use all the available memory bandwidth in the first place, so I hope that the M3 will allow higher utilization.

    We’ll see once people get their hands on them.