It may be interesting to anyone running models across 2 3090s that in llama.cpp/koboldcpp there’s a performance increase if your two GPUs support peering with one another (check with nvidia-smi topo -p2p r) - it wasn’t working with my particular motherboard, so I installed an nvlink bridge and got a performance bump in token generation (an extra 10-20% with 70b, more with smaller models, except smaller models go much faster if you can fit them on one gpu).
I have no idea what the performance diff is between having a bridge and peering via pci-e if your system supports it. I also tested exl2 and there was no difference as I don’t think it implements any sort of peering optimisations.
It may be interesting to anyone running models across 2 3090s that in llama.cpp/koboldcpp there’s a performance increase if your two GPUs support peering with one another (check with
nvidia-smi topo -p2p r
) - it wasn’t working with my particular motherboard, so I installed an nvlink bridge and got a performance bump in token generation (an extra 10-20% with 70b, more with smaller models, except smaller models go much faster if you can fit them on one gpu).I have no idea what the performance diff is between having a bridge and peering via pci-e if your system supports it. I also tested exl2 and there was no difference as I don’t think it implements any sort of peering optimisations.