I guess the question is what is the order we’re talking about for requiring to step up to more parameters? I understand its in billions of parameters and that they are basically the weights between the data it was trained on and is used to predict words (I think of it as a big weight map), so like you can expect “sharp sword” more often than “asprin sword.”
Is there a limit to the data-size used to train the model to the point that you’ll hit a plateau? Like, I imagine training against Shakespire would be harder than Poe because of all the made up words Shakespire uses. I’d probably train Shakespire with his works + wikis and discussions on his work.
I know that’s kind of all over the place, I’m kind of fumbling at the topic trying to get a grasp so I can start prying it open.
you pick the biggest one, it’s almost always the best unless it was truely a shitty trained model, a really well trained 30b with a 120b version 120b will be better, unless you mean by “can run them all” you mean can run full quant 7b and q1_k_m 120b lol
No one has figured out the plateau yet as more data = longer training = more expensive. Currently it seems like you can keep training with more data. Companies are pretty much training on ‘all of the internet’ data to get the LLM ‘cleverness’. Not just Shakespeare.
About deciding the size of the model, there is the Chinchilla scaling law which provides the compute optimal point given a compute budget, ie. 2T on a 7b vs 0.5T on a 13B, the former would be better (made up number). There is also the consideration of the costs of serving the model together with the training cost and the accuracy required.
I have a similar question as OP. What if you wanted to train a model specifically on coding? And even more specifically in say, just a particular library?
You are probably talking about fine tuning then (pre)training a model. There are models that were trained for coding like codellama and all the variants. You could probably train on the library’s code but I doubt you will get much out of it. Perhaps the best way is to create some instruction data based on the library (either manually or synthetic) and fine tune on that.
I’m glad I goofed in my question because your response was super helpful, but I now realize I was missing the terminology when I posted. I was talking about fine tuning an existing model with a specific goal in mind, (re: poetry)
Biggest one you can run at a usable rate , the larger models tend to have more nuance , granted some new models are challenging this notion but that’s the general way to go about it.
for training sometimes is better to pick a small model to do some tests and get faster feedback, then you can train in a larger model if you want to, and see how it goes.
Biggest one.
If I can run them all I will just pick the biggest one.