Currently I have 12+24GB VRAM and I get Out Of Memory all the time when try to fine tune 33B models. 13B is fine, but the outcome is not very good so I would like to try 33B. I wonder if it’s worthy to replace my 12GB GPU with a 24GB one. Thanks!

  • FullOf_Bad_Ideas@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    8bit? 4-bit qlora? You can train 34B models on 24GB. You might need to set up deepspeed if you want to use both, or just train on 24GB card. PSA if you are using axolotl - disabling sample packing is required to enable flash attention 2 and, otherwise flash attention will simply not be enabled. This can spare you some memory. I can train Yi-34B QLoRA with rank 16, ctx 1100 (and maybe some more) on 24GB Ampere card

  • a_beautiful_rhind@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Should work on single 24g gpu as well as either qlora or alpaca_lora_4bit. You won’t get big batches or big context but it’s good enough.

  • Aaaaaaaaaeeeee@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    start with Lora rank=1, 4bit, flash-attention-2, context 256, batchsize=1 until your reach your maximum. Qlora 33b definitely works on just 24gb, it worked back a few months ago.

  • kpodkanowicz@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    i have some issues with flash attention and with 48gb i can go up to 512 rank with batch size 1 and max len 768. My last run was 1024 max len, batch 2, gradient 32, rank 128 and gives pretty nice results

  • Updittyupup@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I think you may need to try to shard optimizer state and gradient. I’ve been using DeepSpeed and have had some good success. Here is a writeup that compares the different DeepSpeed iterations: [RWKV-infctx] DeepSpeed 2 / 3 comparisons | RWKV-InfCtx-Validation – Weights & Biases (wandb.ai). Look at the bottom of article for an accessible overview. I’m not the author, and I haven’t validated the findings. I think more distributed tools are getting more and more necessary. I suppose the option is quantization but may risk quality loss. Here is a discussion on that: https://www.reddit.com/r/LocalLLaMA/comments/153lfc2/quantization_how_much_quality_is_lost/

  • kevdawg464@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I’m a complete noob to LLMs. What is the b in 33b model? And what would be the best place to start learning about building my own local models?

    • Sabin_Stargem@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It isn’t practical for most people to make their own models, that requires industrial hardware. The “b” is billion, which indicates the size and potential intelligence of the model. Right now, the Yi-34b models are the best for that size.

      I recommend using a Mistral 7b as your introduction to LLM. They are small but fairly smart for their size. Get your model from Huggingface. For your model, something like Mistral Dolphin should do fine.

      I recommend KoboldCPP for running a model, as it is very simple to use. It uses GGUF format, allowing you to use your GPU, RAM, and CPU. Other formats are exclusively GPU, offering greater speed but less flexibility.