When training an LLM how do you decide to use a 7b, 30b, 120b, etc model (assuming you can run them all)?

paradigm11235@alien.top · 2 years ago

When training an LLM how do you decide to use a 7b, 30b, 120b, etc model (assuming you can run them all)?

CKtalon@alien.top · 2 years ago

No one has figured out the plateau yet as more data = longer training = more expensive. Currently it seems like you can keep training with more data. Companies are pretty much training on ‘all of the internet’ data to get the LLM ‘cleverness’. Not just Shakespeare.

About deciding the size of the model, there is the Chinchilla scaling law which provides the compute optimal point given a compute budget, ie. 2T on a 7b vs 0.5T on a 13B, the former would be better (made up number). There is also the consideration of the costs of serving the model together with the training cost and the accuracy required.

creaturefeature16@alien.top · 2 years ago

I have a similar question as OP. What if you wanted to train a model specifically on coding? And even more specifically in say, just a particular library?

CKtalon@alien.top · 2 years ago

You are probably talking about fine tuning then (pre)training a model. There are models that were trained for coding like codellama and all the variants. You could probably train on the library’s code but I doubt you will get much out of it. Perhaps the best way is to create some instruction data based on the library (either manually or synthetic) and fine tune on that.

paradigm11235@alien.top · 2 years ago

I’m glad I goofed in my question because your response was super helpful, but I now realize I was missing the terminology when I posted. I was talking about fine tuning an existing model with a specific goal in mind, (re: poetry)