WhereIsYourMind@alien.topBtoLocalLLaMA•Macs with 32GB of memory can run 70B models with the GPU.English
1·
1 year agoI can run Q4 Falcon-180B on my M3 Max (40 GPU) with 128GB RAM. I get 2.5 t/s, it’s crazy for a mobile chip.
I can run Q4 Falcon-180B on my M3 Max (40 GPU) with 128GB RAM. I get 2.5 t/s, it’s crazy for a mobile chip.
I have the M3 Max with 128GB memory / 40 GPU cores.
You have to load a kernel extension to allocate more than 75% of the total SoC memory (128GB * 0.75 = 96GB) to the GPU. I increased it to 90% (115GB) and can run falcon-180b Q4_K_M at 2.5 tokens/s.
I run a code completion server that works like GitHub Copilot. I’m also working on an Mail labeling system using llamacpp and AppleScript, but it is very much a work-in-progress.