Wondering what everyone thinks in case this is true. It seems they’re already beating all open source models including Llama-2 70B. Is this all due to data quality? Will Mistral be able to beat it next year?
Edit: Link to the paper -> https://arxiv.org/abs/2310.17680
The main question is why price it so far below Davinci level, which is 175B?
There’s still a lot of room for models to be trained on more data. Take a look at the Llama papers - at the time training was stopped the loss was still going down. Mistral is on par with L2 13B to L1 30B and it’s a measly 7B model. If GPT-4 truly has a dataset of 13T tokens, the scaling law equations from the Chinchilla paper illustrate that a 20B model trained on 13T tokens would reach lower loss levels than a 70B model trained on 2T tokens. Llama 1 already illustrated that a 7B model could outperform previous open source models (GPT-J-6B, Fairseq-13B, GPT-NeoX-20B, OPT-66B) just by virtue of training on more data and it’s the reason the Llamas are so good to begin with
Model size is important, sure, but there are a lot of important things besides model size when it comes to training a good model