If you’re using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple’s unique architecture for sharing the same high-speed RAM between CPU and GPU.

It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345

See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315

Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature … And tbh that wasn’t that great.

Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you’ll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!

  • bebopkim1372@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    My M1 Max Mac Studio has 64GB of RAM. By running sudo sysctl iogpu.wired_limit_mb=57344, it did magic!

    ggml_metal_init: allocating
    ggml_metal_init: found device: Apple M1 Max
    ggml_metal_init: picking default device: Apple M1 Max
    ggml_metal_init: default.metallib not found, loading from source
    ggml_metal_init: loading '/Users/****/****/llama.cpp/ggml-metal.metal'
    ggml_metal_init: GPU name:   Apple M1 Max
    ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
    ggml_metal_init: hasUnifiedMemory              = true
    ggml_metal_init: recommendedMaxWorkingSetSize  = 57344.00 MiB
    ggml_metal_init: maxTransferRate               = built-in GPU 
    

    Yay!

    • farkinga@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Yeah! That’s what I’m talking about. Would you happen remember what it was reporting before? If it’s like the rest, I’m assuming it said something like 40 or 45gb, right?

      • CheatCodesOfLife@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        64GB M1 Max here. Before running the command, if I tried to load up goliath-120b: (47536.00 / 49152.00) - fails

        And after sudo sysctl iogpu.wired_limit_mb=57344 : (47536.00 / 57344.00)

        So I guess the default is: 49152

        • fallingdowndizzyvr@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          So I guess the default is: 49152

          It is. To be more clear, llama.cpp tells you want the recommendedMaxWorkingSetSize is. Which should match that number.

          • bebopkim1372@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Maybe 47536MB is the net model size. For LLM inference, memory for context and optional context cache memory are also needed.