Hi all I’m wondering if is there a possibility to spread load of localLLM on multiple hosts instead of adding gpu’s to speed up responses. My host do not have gpu’s since I want to be power effective, but they have decent ammont of ram 128. Thx for all ideas.
On another note these gpu manufactures must get their head out of their ass and start cranking out cards with much higher memory capacities. First one to do it cost effectively will gain massive market share and huge profits. Nvidia’s A100 etc doesn’t qualify for this as it’s prohibitively expensive.