At work, we are using four A100 cards (0,1 nvlinked and 2,3 nvlinked) and I am curious about how to connect all four cards. Additionally, when using four A100 cards, the performance seems slower and the token usage is much lower compared to using a 4060 Ti at home. Why might this be? When I check with nvidia-smi, it shows that the VRAM is being fully utilized, but the volatile GPU utilization is not 100% for all four, usually something like 100, 70, 16, 16. (using KVM passthrough rhel8 server)
You must log in or register to comment.
Try different version of the model.
What is the performance of a gguf q4-q6 on a single card?
sry for late reply. i already test about that , it is better than codellama 13b model but , 30token/s …