Much simpler than GPT-4 – the person above seems to be referring to gradient accumulation (since they mentioned minibatches), where you add up gradients until you reach the target batch size, then apply them. This is perfectly equivalent to training on a larger batch.
Actually training on small batches with a low learning rate, however, and applying the gradients immediately, is definitely not equivalent to a bigger batch with a bigger learning rate, especially if you’re in a particularly unstable part of parameter space, where large learning rates might overshoot. On the other hand, the tiny batches would tend to make the direction your model moves somewhat random, which might be good, might be bad.
Whether or not this actually does what OP wants it to is really just an empirical question. If they did it, and it worked better than bigger batches with the same data, then I guess it helped (in this case with this model and this data), haha
I find it odd that your chosen rewards went negative… Doesn’t this imply that the chosen samples became less likely than they were under the base model? You still get model improvements, since the rejected rewards got even less likely, but it’s still odd feeling. Any thoughts there?