There has been a lot of movement around and below the 13b parameter bracket in the last few months but it’s wild to think the best 70b models are still llama2 based. Why is that?
We have 13b models like 8bit bartowski/Orca-2-13b-exl2 approaching or even surpassing the best 70b models now
Look at the market share of video cards with more than 100GB of Vram.
Mistral has already shown that it’s mostly about the data rather than the model. So why waste loads of money and time on training something that no average consumer can run locally?
What do you mean? Someone just posted 100,200 and 600b models and several 120b models have released past couple of weeks.
Those models can’t be accessed, they say it’s “too dangerous to be released”
Google just released a 1.8T model that’s partially trained. Would need a ton of H100’s though just to run it, forget training it lol.
The problem with 70B is that it is incrementally better than smaller models, but is still nowhere near competitive with GPT-4, so it is stuck in no man’s land.
Once we finally get an open source model or architecture that can spar even with GPT-4, let alone 5, there will be much more interest in large models.
Regarding Falcon Chat 180B, it’s no better in my tests and for my use cases than fine tuned Llama 2 70B, which is a shame. It makes me think that there is something fundamentally wrong with Falcon, besides the laughably small context window.
It’s adorable that you think any 13b model is anywhere close to a 70b llama2 model.
Oooh! Model fight! I’ll try it out and post results later.
Diminishing returns and cost of compute.
If people saw better returns from larger models, there would be more.
Qwen 72b is comming in 2 days 👍 Will be a real beast.
I heard, if it comes out then finally it might be worth exllama supporting it. I heard the 14b was fairly strong.
Yes I also hope it get’s exllamav2 support, here is a issue regarding it: (Qwen model not supported) · Issue #160 · turboderp/exllamav2 (github.com)
Qwen 72b
I can’t seem to find anything about qwen 72b except two tweets from a month ago that said it was coming out. who makes it? what’s it trained on? any details?
Curiously nobody from the previous comment upvoters have provided an answer to your question.
2 days? Bro if they said November and haven’t released it by now, it’s not two days.
I’ve been training a lot lately, mostly on RunPod, a mix of fine-tuning Mistral 7B and training LoRA and QLoRAs on 34B and 70Bs. My main takeaway is that the LoRA outcomes are just… not so great. Whereas I’m very happy with the Mistral fine-tunes.
I mean, it’s fantastic we can tinker with a 70B at all, but it doesn’t matter how good your dataset is, you just can’t have the same impact as you can with a full finetune. I think this is why model merging/frankensteining has become popular, it’s an expression of the limitations of LoRA training.
Personally, I have high hopes for a larger Mistral model (in the 13-20B range) that we can still do a full fine-tune on. Right now, between my own specific tunes of Mistral and some of the recent external tunes like Starling I feel like I’m close to having the tools I want/need. But Mistral is still 7B, it doesn’t matter how well it’s tuned, it will still get a little muddled at times, particular with longer term dependencies.
I have been trying to learn about fine-tuning and lora training for the past couple weeks but I’m having trouble finding easy enough resources to learn from. Could you give me some pointers to what I can read to get started with finetuning llama2 or mistral?
I have tried training quantized models locally with oobabooga and llama.cpp and I also have access to runpod. Really appreciate any info!
Do you think that finetuning models with more parameters requires more data to actually do something?
With a full finetune I don’t think so – the LIMA paper showed that 1000 high quality samples is enough with a 65B model. With QLoRA and LoRA, I don’t know. The number of parameters you’re affecting is set by the rank you choose. It’s important to get the balance between the rank, dataset size, and learning rate right. Style and structure is easy to impart, but other things not so much. I often wonder how clean the merge process actually is. I’m still learning.
It took 3,311,616 hours of training for the llama2 70b base model. At $1/hour for an A100 GPU you’d spend just over $3M and it would take approximately 380 years to train the model.
Scale that across 10,000 GPUs and you’re looking at 2 weeks and a couple of million dollars.
Fine tune training is much, much faster and cheaper.
How much would that be in H100s or H200s?
About tree fiddy
A bushel.
$1/hour for an A100 ? Where? I can barely get one in GCE and it’s almost 4$ / hr
I’d like to know too if there’s one for exactly $1. Even half a buck or so difference builds up over time.
But runpod’s close at least, at $1.69/hour.
Yes, but you don’t have Meta’s purchasing power to rent 10,000 GPUs for a month. Economies of scale, my friend!
I’ll reply to myself!
It’s not just about GPU expense. You need a small team of ML data scientists. You need access to (or a way to scrape/generate) a mind-bogglingly broad dataset. You need to clean, normalize, and prepare the dataset. All of this takes a huge amount of expertise, time and money. I wouldn’t be at all surprised if the auxiliary costs surpassed the GPU rental cost.
So the main answer to your question “Why is no one releasing 70b models?” is: it’s really, really, really expensive. Other parts of the answer are: lack of expertise, difficulty of generating a good dataset, and probably a hundred things I haven’t thought of.
But mainly it just comes down to cost. I bet you wouldn’t see any change from $5,000,000 if you wanted to make your own new 70b base model.
No point to release a model that hardly anyone can run.
13B and 7B can be run by the majority of users, 70B not so much…
Who pays for all this training on all these models we see knocking about and I don’t mean the ones released by the big companies? Like who has the resources to train a 70b model? Like one of the guys below said 1.7 million GPU hours for example thats pretty friggin expensive no?
You need at least 4 A100 for inference
13b models magically being better then 70b models is a myth. Most of the 7b or 13b model headlines are just clickbait, the models being good at benchmarks because they where trained on benchmark data.
Try Airo 70b 3.1.2, it is much, much better (for general purposes) then 99% of models out there. Yi based models are strong if you want the larger context.
Orca still memeing strong.