As the title says when combining a p40 and a rtx 3090 a few use casese come to mind and i wanted to know if they could be done ? greatly appreciate your help:
first could you run larger modells where they are computed on the 3090 and the p40 is just used for vram offloading and would that be faster then system memory ?

Could you compute on both of them in a asymetric fashion like putting some layers on the RTX3090 and fewer on the p40 ?

Lastly and that one probably works you could run two different instances of LLms for example a bigger one on the 3090 and a smaller on the p40 i asume.

  • Noxusequal@alien.topOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    okay thank you guys so this only really makes sense if i want to run different models on the different gpus or if i have something so big i need the 48gb of vram for and i can deal with the slower speeds :) thanks for the feedback

  • Tiny_Arugula_5648@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    no absolutely not… not how you described it. the issue isn’t about RAM it’s about the numbers of calculations that need to be done. With GPUs you need to load the data into VRAM and that is only going to be available for that GPUs calculations it’s not a shared memory pool. So load data into the p40 it will only be able to use that for it’s calculations.

    Yes you can run the model on multiple GPUs. If one of those is very slow with lots of RAM then the layers you offload to that card will be processed slowly. No there is no way to speed up calculations. VRAM is only making the weights readily available so you’re not constantly loading and unloading the model weights.

      • ReturningTarzan@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        It’s not. If the only thing you’re using the P40 for is as swap space for the 3090, then you’re better off just using system RAM, since you’ll have to swap via system RAM anyway.

    • Hoppss@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      This is not true, I have split two separate LLM models partially across a 4090 and a 3080 and have had them both run inference at the same time.

      This can be done in oobabooga’s repo with just a little tinkering.