If you’re using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple’s unique architecture for sharing the same high-speed RAM between CPU and GPU.
It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345
See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315
Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature … And tbh that wasn’t that great.
Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you’ll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!
≥64GB allows 75% to be used by GPU. ≤32 its ~66%. Not sure about the 36GB machines.