• 0 Posts
  • 3 Comments
Joined 1 year ago
cake
Cake day: November 8th, 2023

help-circle

  • Much simpler than GPT-4 – the person above seems to be referring to gradient accumulation (since they mentioned minibatches), where you add up gradients until you reach the target batch size, then apply them. This is perfectly equivalent to training on a larger batch.

    Actually training on small batches with a low learning rate, however, and applying the gradients immediately, is definitely not equivalent to a bigger batch with a bigger learning rate, especially if you’re in a particularly unstable part of parameter space, where large learning rates might overshoot. On the other hand, the tiny batches would tend to make the direction your model moves somewhat random, which might be good, might be bad.

    Whether or not this actually does what OP wants it to is really just an empirical question. If they did it, and it worked better than bigger batches with the same data, then I guess it helped (in this case with this model and this data), haha


  • By having a separate translation module, you’re making the decision for the model about which parameters should be used for translation, and which for learning about the world.

    With an extremely small model (one that doesn’t have the capacity to even fully learn English), this would probably be reasonable. With any other size of model (100–200 million parameters and up, maybe?), it would be far, far more effective to let the model pick and choose how it allocates its parameters.

    Often, this will lead to a perfect meld of translation and learning, to the point that we don’t currently even know how to figure out whether a given neuron or set of neurons does one task or another. The current most likely theory (in my opinion) is that most neurons are multitask.