I guess the question is what is the order we’re talking about for requiring to step up to more parameters? I understand its in billions of parameters and that they are basically the weights between the data it was trained on and is used to predict words (I think of it as a big weight map), so like you can expect “sharp sword” more often than “asprin sword.”

Is there a limit to the data-size used to train the model to the point that you’ll hit a plateau? Like, I imagine training against Shakespire would be harder than Poe because of all the made up words Shakespire uses. I’d probably train Shakespire with his works + wikis and discussions on his work.

I know that’s kind of all over the place, I’m kind of fumbling at the topic trying to get a grasp so I can start prying it open.

  • lordpuddingcup@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    you pick the biggest one, it’s almost always the best unless it was truely a shitty trained model, a really well trained 30b with a 120b version 120b will be better, unless you mean by “can run them all” you mean can run full quant 7b and q1_k_m 120b lol

  • CKtalon@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    No one has figured out the plateau yet as more data = longer training = more expensive. Currently it seems like you can keep training with more data. Companies are pretty much training on ‘all of the internet’ data to get the LLM ‘cleverness’. Not just Shakespeare.

    About deciding the size of the model, there is the Chinchilla scaling law which provides the compute optimal point given a compute budget, ie. 2T on a 7b vs 0.5T on a 13B, the former would be better (made up number). There is also the consideration of the costs of serving the model together with the training cost and the accuracy required.

    • creaturefeature16@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I have a similar question as OP. What if you wanted to train a model specifically on coding? And even more specifically in say, just a particular library?

      • CKtalon@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        You are probably talking about fine tuning then (pre)training a model. There are models that were trained for coding like codellama and all the variants. You could probably train on the library’s code but I doubt you will get much out of it. Perhaps the best way is to create some instruction data based on the library (either manually or synthetic) and fine tune on that.

        • paradigm11235@alien.topOPB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          I’m glad I goofed in my question because your response was super helpful, but I now realize I was missing the terminology when I posted. I was talking about fine tuning an existing model with a specific goal in mind, (re: poetry)

  • ThinkExtension2328@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Biggest one you can run at a usable rate , the larger models tend to have more nuance , granted some new models are challenging this notion but that’s the general way to go about it.

  • rvitor@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    for training sometimes is better to pick a small model to do some tests and get faster feedback, then you can train in a larger model if you want to, and see how it goes.