@ganzzahl

ganzzahl@alien.top · 2 years ago

I find it odd that your chosen rewards went negative… Doesn’t this imply that the chosen samples became less likely than they were under the base model? You still get model improvements, since the rejected rewards got even less likely, but it’s still odd feeling. Any thoughts there?

ganzzahl@alien.top · 2 years ago

Much simpler than GPT-4 – the person above seems to be referring to gradient accumulation (since they mentioned minibatches), where you add up gradients until you reach the target batch size, then apply them. This is perfectly equivalent to training on a larger batch.

Actually training on small batches with a low learning rate, however, and applying the gradients immediately, is definitely not equivalent to a bigger batch with a bigger learning rate, especially if you’re in a particularly unstable part of parameter space, where large learning rates might overshoot. On the other hand, the tiny batches would tend to make the direction your model moves somewhat random, which might be good, might be bad.

Whether or not this actually does what OP wants it to is really just an empirical question. If they did it, and it worked better than bigger batches with the same data, then I guess it helped (in this case with this model and this data), haha

ganzzahl@alien.top · 2 years ago

By having a separate translation module, you’re making the decision for the model about which parameters should be used for translation, and which for learning about the world.

With an extremely small model (one that doesn’t have the capacity to even fully learn English), this would probably be reasonable. With any other size of model (100–200 million parameters and up, maybe?), it would be far, far more effective to let the model pick and choose how it allocates its parameters.

Often, this will lead to a perfect meld of translation and learning, to the point that we don’t currently even know how to figure out whether a given neuron or set of neurons does one task or another. The current most likely theory (in my opinion) is that most neurons are multitask.