Was wondering if theres anyway to use a bunch of old equipment like this to build an at home crunch center for running your own LLM at home, and whether it would be worth it.
Hopefully the proposed S-LoRa’s will allow to do more with less.
Those series of Nvidia gpus didn’t have tensor cores yet and believe they started in 20xx series. I not sure how much it impacts inference purposes vs training/fine tuning but worth doing more research. From what I gathered the answer is “no” unless you use a 10xx for like monitor output, TTS, or other smaller co-llm use that you don’t want taking vram away from your main LLM GPUs.
I wish they come up with some extendable tensor chips that can work with old laptops.
Currently only 7b is the only model we can run comfortably. Even for 13 b, it’s slower and it needs quite a bit of adjustment.
Another consideration is that I was told by someone with multiple cards, that if you split your layers across multiple cards, they don’t all process the layers simultaneously.
So, if you are on 3x cards, you don’t get a parallel performance benefit of all cards working at the same time. It processes layers on card 1, then card 2, then card 3.
The slowest card will obviously have the worst speed. Not sure what this will do for your load times of a model or your electricity bill, as well as the fact you need a system big enough to fit them all in.
The ONLY pascal card worth bothering with is the P40. It’s not fast, but it’s the cheapest way to get a whole bunch of usable vram. Nothing else from that generation is worth the effort.
I tried it. Something like 1.2 tokens inference on lamma 70b with a mix of cards (but 4 1080s). Would process would crash occasionally. Ideally every card would have the same vram.
Going to try it with 1660 TI’s. I think it may be the ‘sweet spot’ in power to price to performance.
Did you use some q3 gguf quant with this?
You might as well use the cards if you have them already. I’m currently getting around 5-6 tokens per second when running nous-capybara 34b q4_k_m on a 2080ti 22gb and a p102 10gb (basically a semi lobotomized 1080ti). The p102 does bottleneck the 2080ti, but hey, at least it runs at a near usable speed! If I try running on CPU (I have a r9 3900) I get something closer to 1 token per second.
How did you get your 2080 ti to 22gb of VRAM?
Modded cards are quite easy to obtain in china