It’s not going to help because the model data is much larger than the cache and the access pattern is basically long sequential reads.
It’s not going to help because the model data is much larger than the cache and the access pattern is basically long sequential reads.
A chip that won’t be available for ~6 months will be better than a chip that came out a year ago? Amazing ;)
GPT4All is similar to LM Studio, but includes the ability to load a document library and generate text against it.
180GB/s isn’t really all that fast.
It’s a lot cheaper than paying humans to spread propaganda.
Consider that the audience isn’t you, it’s people who lack discernment. It’s like those scam emails. People with good judgement delete them.
The other audience is engagement algorithms.
That’s probably the argument for all cloud architecture.
Long-term cost and risk might be persuasive, that hasn’t swayed IT managers thus far for non-LLM specific infrastructure
It’s 2023. What are you talking about? Where have you been?
It doesn’t
Apple Silicon Macs are great options for running LLMs, especially so if you want to run a large LLM on a laptop. With that said, there aren’t big performance differences between the M1 Max & M3 Max, at least not for text generation, prompt processing does show generational improvements… Maybe this will change in future versions of MacOS if there are optimizations to unlock better Metal Performance Shader performance on later GPU generations, but for right now, they are pretty similar.
Apple Silicon Macs aren’t currently a great option for training/fine tuning models. There isn’t a lot of software support for GPU acceleration during training on Apple Silicon.
If model size is a priority, the Apple Silicon macs (particularly used or factory refurbished Mac Studio Ultras) provide good value (cost + available memory + performance. Ie 4,679 for 128GB -- 96GB usable by GPU for model + working data). Workstation or multiple high end consumer GPUs can be faster, but also more expensive, more power consumption, bigger case, louder…)
Software options for doing training or fine tuning on Macs using GPU are limited at this point, but will probably improve. This might also be something better done with short term rental of a cloud server.
What are you using to run them?
In any case, larger context models require *a lot* more RAM/VRAM.
What quantization are you using? Smaller tends to be faster.
I get 30 tokens/s with a q4_0 quantization of 13B models on a M1 Max on Ollama (which uses llama.cpp). You should be in the same ballpark with the same software. You aren’t going to do much/any better than that. The M3’s GPU made some significant leaps for graphics, and little to nothing for LLMs.
Allowing more threads isn’t going to help generation speed, it might improve prompt processing though. Probably best though to keep the number of threads to the number of performance cores.
When I load yarn-mistral-64k in Ollama (uses llama.cpp) on my 32GB MAC it allocates 16.359 GB for the GPU. I don’t remember how much the 128k context version needs, but it was more than the 21.845GB MacOS allows for the GPUs use on a 32GB machine. You aren’t going to get very far on a 16GB machine.
Maybe if you don’t send any layers to the GPU and force it to use CPU you could eek out a little more. On Apple Silicon CPU inference only seems to be a 50% hit over GPU speeds, if I remember right.
What software are you using to run LLaMA and Stable Diffusion?
What version of the LLaMA model are you trying to run? How many parameters? What quantization?
This seems like this approach could also be useful in situations where the goal isn’t speed, but rather “quality” (by a variety of metrics).
Please get specific. What’s “quite slow,” what’s “extremely quickly.” Use numbers and units that include a unit of time.
What hardware are you running on? Without changing hardware your best bet is a smaller model (in terms of parameters), or a smaller quantization of a 13b model, or both.
Game GPU-use probably hits the cache. LLM really won’t since each token involves reading all the model data.
I think part of the answer is that RAM uses more power than you think when it’s running near full-tilt, like it is during generation. Micron’s advice is to figure 3w per 8GB for DDR4, and more than that for the highest performance parts. The fact that the RAM is on package probably offsets that somewhat, but that’s still more than single digits.
Power consumption on my 24Core GPU M1 Max is similar to yours, though somewhat lower as you’d expect, according to both iStat Menus and Stats.app.
There is also the question of how accurate they are.
LLVM or LLM?
Apple is increasing differentiation amongst their chips. Previously the Pro and Max different primarily in GPU cores. Now they are also differentiated in CPU cores and memory bandwidth.
I was disappointed to see that the M3 Maxs memory bandwidth is the same, on paper, as the M2 Max, but I’m also mindful of the fact that no one functional unit was able to use all the available memory bandwidth in the first place, so I hope that the M3 will allow higher utilization.
We’ll see once people get their hands on them.
≥64GB allows 75% to be used by GPU. ≤32 its ~66%. Not sure about the 36GB machines.