Hi all, I bought a new pc last year and after experimenting with llms for the last months I have some doubts. I can run 7b, 13b and even 20/30b model reasonably fast but the 70b (I use the q3 quantization, GGUF format) run at 1t/s using windows 11. I´m thinking about how to upgrade my pc so I can get at least 2/3 t/s with a q4 70b. My specs are:

-MSI PRO B760-P WIFI DDR4

-Intel 13700 cpu (the NOT k model, and it´s a little undervolted)

-Nvidia 4080 16Gb gpu

-2x16gb 3200mHZ CL16 RAM

-2 NVMe SSDs

-1 old HDD from my old computer

-Seasonic 850W gold psu

The option I though were:

a) Substitute the old hdd for a bigger sata ssd, make a partition and install a linux distro that I would use in dual boot only for llms.

b) Adding a 3060 12gb or a 4060ti 16gb as a second gpu. I would only use the second gpu for the llms.

c) Both?

So, what are the pros and cons? Other options? Can my psu support a second GPU? Is there a difference in performance when running the models in a NVMe SSD compared to a sata SSD? There would be compatibility problems using the 4080 and a 3060 as those gpus are from different generations? How much performance improvement can I expect?

Thanks a lot for the help!

  • Imaginary_Bench_7294@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If your goal is to run the model locally, your best option is to increase your Vram as much as you can. Main things to consider is the vram bandwidth of the card and the capacity. For a 70b 4 bit model you’re looking at needing somewhere around 35-40 GB of vram.

    The model alone will take roughly 35GB, the loader up to another 3GB, and then the full context length of 4096 could spill it over 40GB.

    I run LZLV 70b, 4.65bit on 2x3090’s and get 4.5+ T/s using ExllamaV2 and the EXL2 format. That is at full context length and chat mode in Oobabooga.

    In the default/notebook modes I can get 7+ T/s at full context length.

    Now, your power supply may be on the low side to add another card without putting power limiters on things. I’ll use stock power settings as reference.

    4080 is rated to hit 320 w

    13700 is rated at 65W

    Let’s ass in another 100 watts for SSD’s, HDD’s, mobo and cooling solutions.

    So you’re looking at 485w of draw. You should always shoot for a minimum 10-15% overhead, which cuts your max draw down to 722-765 watts.

    That leaves you 237-280w of possible room to play with.

    So it’s possible to add another video card to the computer, but you’ll have to use GGUF and llama.cpp to do mixed compute with the video card and CPU. That will probably get you up to the 2, maybe 3 T/s at the start, but I don’t know about full 4096 context.