Currently I have 12+24GB VRAM and I get Out Of Memory all the time when try to fine tune 33B models. 13B is fine, but the outcome is not very good so I would like to try 33B. I wonder if it’s worthy to replace my 12GB GPU with a 24GB one. Thanks!
Currently I have 12+24GB VRAM and I get Out Of Memory all the time when try to fine tune 33B models. 13B is fine, but the outcome is not very good so I would like to try 33B. I wonder if it’s worthy to replace my 12GB GPU with a 24GB one. Thanks!
I think you may need to try to shard optimizer state and gradient. I’ve been using DeepSpeed and have had some good success. Here is a writeup that compares the different DeepSpeed iterations: [RWKV-infctx] DeepSpeed 2 / 3 comparisons | RWKV-InfCtx-Validation – Weights & Biases (wandb.ai). Look at the bottom of article for an accessible overview. I’m not the author, and I haven’t validated the findings. I think more distributed tools are getting more and more necessary. I suppose the option is quantization but may risk quality loss. Here is a discussion on that: https://www.reddit.com/r/LocalLLaMA/comments/153lfc2/quantization_how_much_quality_is_lost/
Thank you! It looks very deep to me, I will look into it.