Yea I’m saying that ChatGPT outputs are contained on internet posts in the year 2023, so simply training from 2023 internet data would end up with training on ChatGPT data as a side effect.
Yea I’m saying that ChatGPT outputs are contained on internet posts in the year 2023, so simply training from 2023 internet data would end up with training on ChatGPT data as a side effect.
Even if it can get just half way between gpt-3.5 vs 4… that would be big in my opinion
You gloss over “MoE just helps with FLOPS issues” as if that’s not a hugely important factor.
So many people have a 16 or 24GB GPU, or even 64GB + Macbooks that aren’t being fully utilized.
Sure people can load a 30B Q5 model into their 24GB GPU or a 70B Q5 model into their 48GB+ of memory in a macbook, but the main reason we don’t is because it’s so much slower, because it takes so much more FLOPS…
People are definitely willing to sacrifice vram for speed and that’s what MoE allows you to do.
You can have a 16 sub-network MoE with 100B parameters loaded comfortably into a macbook pro with 96GB of memory at Q5 with the most useful 4 subnetworks activated (25B params) for any given token,
this would benchmark significantly higher than current 33B dense models when done right and act much smarter than a 33B model while also being around the same speed as a 33B model.
Its all around more smarts for the same speed and the only downside is that it’s just using the extra VRAM that you probably weren’t using before anyways
Already Mistral 7B fine tunes reaching parity with gpt-3.5 in most benchmarks.
I’d be very surprised if Llama-3 70B fine tunes don’t significantly outperform GPT-3.5 in nearly every metric.
It referring to itself as a GPT could just be from pre-training internet data if it was trained on internet data from 2023.
So far have only benchmarked Hellaswag and Arc Challenge but it’s significantly beating both WizardLM-13B and GPT4-X-Vicuna-13B on both benchmarks! These are not the latest sota models ofcourse but it’s amazing to see how this 3B model is surpassing the best 13B models of just 6 months ago.
I’ll see if we can have it benchmarked officially on the HF leaderboard this week so people can see how it compares with latest models.
I can almost guarantee you that Capybara 3B and Obsidian 3B will perform would perform even significantly better than orca mini. The base model that I’m using for training 3B is the much newer StableLM 3B model trained for 4 trillion tokens of training while orca mini base model is open llama 3B which was only trained on around 1-2 Trillion tokens and performs significantly worse.
Predicting the loss is very different from predicting real world abilities, they are able to top the former, not the latter.
Predicting the future loss once you’re already 10% into training is fairly trivial. Predicting the actual abilities though is not.