If you’re using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple’s unique architecture for sharing the same high-speed RAM between CPU and GPU.

It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345

See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315

Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature … And tbh that wasn’t that great.

Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you’ll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!

  • CheatCodesOfLife@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    64GB M1 Max here. Before running the command, if I tried to load up goliath-120b: (47536.00 / 49152.00) - fails

    And after sudo sysctl iogpu.wired_limit_mb=57344 : (47536.00 / 57344.00)

    So I guess the default is: 49152

    • fallingdowndizzyvr@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      So I guess the default is: 49152

      It is. To be more clear, llama.cpp tells you want the recommendedMaxWorkingSetSize is. Which should match that number.

      • bebopkim1372@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Maybe 47536MB is the net model size. For LLM inference, memory for context and optional context cache memory are also needed.