The title, pretty much.

I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

  • harrro@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).

      • harrro@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

        Not sure on AMD support but for nvidia it’s pretty easy to do.