When training an LLM how do you decide to use a 7b, 30b, 120b, etc model (assuming you can run them all)?

paradigm11235@alien.top · 2 years ago

When training an LLM how do you decide to use a 7b, 30b, 120b, etc model (assuming you can run them all)?

lordpuddingcup@alien.top · 2 years ago

you pick the biggest one, it’s almost always the best unless it was truely a shitty trained model, a really well trained 30b with a 120b version 120b will be better, unless you mean by “can run them all” you mean can run full quant 7b and q1_k_m 120b lol

CKtalon@alien.top · 2 years ago

No one has figured out the plateau yet as more data = longer training = more expensive. Currently it seems like you can keep training with more data. Companies are pretty much training on ‘all of the internet’ data to get the LLM ‘cleverness’. Not just Shakespeare.

About deciding the size of the model, there is the Chinchilla scaling law which provides the compute optimal point given a compute budget, ie. 2T on a 7b vs 0.5T on a 13B, the former would be better (made up number). There is also the consideration of the costs of serving the model together with the training cost and the accuracy required.

creaturefeature16@alien.top · 2 years ago

I have a similar question as OP. What if you wanted to train a model specifically on coding? And even more specifically in say, just a particular library?

CKtalon@alien.top · 2 years ago

You are probably talking about fine tuning then (pre)training a model. There are models that were trained for coding like codellama and all the variants. You could probably train on the library’s code but I doubt you will get much out of it. Perhaps the best way is to create some instruction data based on the library (either manually or synthetic) and fine tune on that.

paradigm11235@alien.top · 2 years ago

I’m glad I goofed in my question because your response was super helpful, but I now realize I was missing the terminology when I posted. I was talking about fine tuning an existing model with a specific goal in mind, (re: poetry)

ThinkExtension2328@alien.top · 2 years ago

Biggest one you can run at a usable rate , the larger models tend to have more nuance , granted some new models are challenging this notion but that’s the general way to go about it.

rvitor@alien.top · 2 years ago

for training sometimes is better to pick a small model to do some tests and get faster feedback, then you can train in a larger model if you want to, and see how it goes.

You_Wen_AzzHu@alien.top · 2 years ago

Biggest one.

tgredditfc@alien.top · 2 years ago

If I can run them all I will just pick the biggest one.