New Microsoft codediffusion paper suggests GPT-3.5 Turbo is only 20B, good news for open source models?

obvithrowaway34434@alien.top · 2 years ago

New Microsoft codediffusion paper suggests GPT-3.5 Turbo is only 20B, good news for open source models?

FairSum@alien.top · 2 years ago

The main question is why price it so far below Davinci level, which is 175B?

There’s still a lot of room for models to be trained on more data. Take a look at the Llama papers - at the time training was stopped the loss was still going down. Mistral is on par with L2 13B to L1 30B and it’s a measly 7B model. If GPT-4 truly has a dataset of 13T tokens, the scaling law equations from the Chinchilla paper illustrate that a 20B model trained on 13T tokens would reach lower loss levels than a 70B model trained on 2T tokens. Llama 1 already illustrated that a 7B model could outperform previous open source models (GPT-J-6B, Fairseq-13B, GPT-NeoX-20B, OPT-66B) just by virtue of training on more data and it’s the reason the Llamas are so good to begin with

Model size is important, sure, but there are a lot of important things besides model size when it comes to training a good model