• FairSum@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The main question is why price it so far below Davinci level, which is 175B?

    There’s still a lot of room for models to be trained on more data. Take a look at the Llama papers - at the time training was stopped the loss was still going down. Mistral is on par with L2 13B to L1 30B and it’s a measly 7B model. If GPT-4 truly has a dataset of 13T tokens, the scaling law equations from the Chinchilla paper illustrate that a 20B model trained on 13T tokens would reach lower loss levels than a 70B model trained on 2T tokens. Llama 1 already illustrated that a 7B model could outperform previous open source models (GPT-J-6B, Fairseq-13B, GPT-NeoX-20B, OPT-66B) just by virtue of training on more data and it’s the reason the Llamas are so good to begin with

    Model size is important, sure, but there are a lot of important things besides model size when it comes to training a good model