China is retrofitting consumer RTX4090s with 2 slot blower for ML

--dany--@alien.top · 3 years ago

China is retrofitting consumer RTX4090s with 2 slot blower for ML

fallingdowndizzyvr@alien.top · 3 years ago

The used market only a year ago was flooded with things like mi25’s and above that were being liquidated.

The MI25 is finally getting the love it deserves. I wish I had bought more when they were $65-$70 a few months ago. But I was hoping they would go lower. Even last month or so, I think I saw that they were $90. Right now, I just checked before posting, the seller with the most is selling them for $160. Crazy.

By the way, the one I got is in really good shape. As in really good. If the seller told me they were new, I would believe it. There’s not a speck of dust on it. Like no where and I looked deep into the fins of the heatsink. Even the fingers on the slot looked basically new.

The only upside compared to the Crypto boom I guess is that with AI based use cases is that PCIe bus speeds matter and this is stopping people buying anything and everything then slapping 8 GPU’s in an AI mining rig.

I don’t think that’s blanket true. I think it really depends what you do with it. I can think of a couple of uses off the top of my head where 8 GPUs sitting on yanky PCIe 1x would be fine.

Use them as a team. Nothing says you can only use them to infer one large model. You can run 8 7b-13b models. One model per card. The 1x speed wouldn’t really matter in that case after the model is loaded. Having a team of small models run instead of 1 large model is a valid way to go.
Batch process 8 different prompts on a large model spread across the GPUs. Since inference is sequential, only 1 GPU is active at a time when only processing a prompt. The others 7 GPUs are idle. Don’t let them idle. Vectorize it. Process 8 or more prompts at the same time. Once the vector is full, all 8 GPUs will be running. One the t/s for any one prompt won’t be fast. The overall throughput t/s for all the prompts will be. It would be best to keep the prompts coming and thus the vector full to keep all GPUs running. So a good application for this is on a server that is inferring multiple prompts from multiple users. Or multiple prompts from the same user. Or the same prompt 8 different times. Since you can as the same model the same question 8 times and get 8 different answers. Let it process it 8 times and pick the best answer.
There are techniques that can allow for inference to be paralyzed. That may run great on a mining rig with 8 GPUs.

So it’s far from useless to repurpose an old mining rig. You just have to be creative.

China is retrofitting consumer RTX4090s with 2 slot blower for ML

China is retrofitting consumer RTX4090s with 2 slot blower for ML

Sidestepping GPU ban, Chinese factories dismantle and transform Nvidia RTX 4090 gaming cards into AI accelerators