There has been a lot of movement around and below the 13b parameter bracket in the last few months but it’s wild to think the best 70b models are still llama2 based. Why is that?

We have 13b models like 8bit bartowski/Orca-2-13b-exl2 approaching or even surpassing the best 70b models now

  • obvithrowaway34434@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Mistral has already shown that it’s mostly about the data rather than the model. So why waste loads of money and time on training something that no average consumer can run locally?

  • a_beautiful_rhind@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    What do you mean? Someone just posted 100,200 and 600b models and several 120b models have released past couple of weeks.

  • Markon101@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Google just released a 1.8T model that’s partially trained. Would need a ton of H100’s though just to run it, forget training it lol.

  • extopico@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The problem with 70B is that it is incrementally better than smaller models, but is still nowhere near competitive with GPT-4, so it is stuck in no man’s land.

    Once we finally get an open source model or architecture that can spar even with GPT-4, let alone 5, there will be much more interest in large models.

    Regarding Falcon Chat 180B, it’s no better in my tests and for my use cases than fine tuned Llama 2 70B, which is a shame. It makes me think that there is something fundamentally wrong with Falcon, besides the laughably small context window.

  • candre23@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It’s adorable that you think any 13b model is anywhere close to a 70b llama2 model.

  • Antique_Elk9380@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Diminishing returns and cost of compute.

    If people saw better returns from larger models, there would be more.

  • thereisonlythedance@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I’ve been training a lot lately, mostly on RunPod, a mix of fine-tuning Mistral 7B and training LoRA and QLoRAs on 34B and 70Bs. My main takeaway is that the LoRA outcomes are just… not so great. Whereas I’m very happy with the Mistral fine-tunes.

    I mean, it’s fantastic we can tinker with a 70B at all, but it doesn’t matter how good your dataset is, you just can’t have the same impact as you can with a full finetune. I think this is why model merging/frankensteining has become popular, it’s an expression of the limitations of LoRA training.

    Personally, I have high hopes for a larger Mistral model (in the 13-20B range) that we can still do a full fine-tune on. Right now, between my own specific tunes of Mistral and some of the recent external tunes like Starling I feel like I’m close to having the tools I want/need. But Mistral is still 7B, it doesn’t matter how well it’s tuned, it will still get a little muddled at times, particular with longer term dependencies.

    • Vilzuh@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I have been trying to learn about fine-tuning and lora training for the past couple weeks but I’m having trouble finding easy enough resources to learn from. Could you give me some pointers to what I can read to get started with finetuning llama2 or mistral?

      I have tried training quantized models locally with oobabooga and llama.cpp and I also have access to runpod. Really appreciate any info!

    • Armym@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Do you think that finetuning models with more parameters requires more data to actually do something?

      • thereisonlythedance@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        With a full finetune I don’t think so – the LIMA paper showed that 1000 high quality samples is enough with a 65B model. With QLoRA and LoRA, I don’t know. The number of parameters you’re affecting is set by the rank you choose. It’s important to get the balance between the rank, dataset size, and learning rate right. Style and structure is easy to impart, but other things not so much. I often wonder how clean the merge process actually is. I’m still learning.

  • __JockY__@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It took 3,311,616 hours of training for the llama2 70b base model. At $1/hour for an A100 GPU you’d spend just over $3M and it would take approximately 380 years to train the model.

    Scale that across 10,000 GPUs and you’re looking at 2 weeks and a couple of million dollars.

    Fine tune training is much, much faster and cheaper.

      • toothpastespiders@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        I’d like to know too if there’s one for exactly $1. Even half a buck or so difference builds up over time.

        But runpod’s close at least, at $1.69/hour.

      • __JockY__@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Yes, but you don’t have Meta’s purchasing power to rent 10,000 GPUs for a month. Economies of scale, my friend!

    • __JockY__@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I’ll reply to myself!

      It’s not just about GPU expense. You need a small team of ML data scientists. You need access to (or a way to scrape/generate) a mind-bogglingly broad dataset. You need to clean, normalize, and prepare the dataset. All of this takes a huge amount of expertise, time and money. I wouldn’t be at all surprised if the auxiliary costs surpassed the GPU rental cost.

      So the main answer to your question “Why is no one releasing 70b models?” is: it’s really, really, really expensive. Other parts of the answer are: lack of expertise, difficulty of generating a good dataset, and probably a hundred things I haven’t thought of.

      But mainly it just comes down to cost. I bet you wouldn’t see any change from $5,000,000 if you wanted to make your own new 70b base model.

  • arekku255@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    No point to release a model that hardly anyone can run.

    13B and 7B can be run by the majority of users, 70B not so much…

  • WaterPecker@alien.top
    cake
    B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Who pays for all this training on all these models we see knocking about and I don’t mean the ones released by the big companies? Like who has the resources to train a 70b model? Like one of the guys below said 1.7 million GPU hours for example thats pretty friggin expensive no?

  • ChiefBigFeather@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    13b models magically being better then 70b models is a myth. Most of the 7b or 13b model headlines are just clickbait, the models being good at benchmarks because they where trained on benchmark data.

    Try Airo 70b 3.1.2, it is much, much better (for general purposes) then 99% of models out there. Yi based models are strong if you want the larger context.