Train Smarter, Not Harder? - MiniSymposium 7b

kindacognizant@alien.top · 2 years ago

Train Smarter, Not Harder? - MiniSymposium 7b

vasileer@alien.top · 2 years ago

Multiple passes at lower learning rates isn’t supposed to produce different results.

Overfitting is not a technical challenge, its a mathematical property which undeniably exists when ever the training data is smaller than the full problem domain and simultaneously the learning rate (importantly - multiplied by the number of epochs!) would result in a higher specialization ration on the learned to unobserved data than would be expected based on the ration of the learned to unobserved size.

Basically if you learn 1 digit addition but half your training sets involve the left number being 1 and none of your training sets involve your left number being 5 then likely your model will treat 5 and 1 the same (since it’s so over trained on examples with 1s)

GPT-4:

The statement contains several inaccuracies:

Multiple passes at lower learning rates: It’s not entirely true that multiple passes with lower learning rates will produce identical results. Different learning rates can lead to different convergence properties, and multiple passes with lower learning rates can help in fine-tuning the model and potentially avoid overfitting by making smaller, more precise updates to the weights.
Overfitting as a mathematical property: Overfitting is indeed more of an empirical observation than a strict mathematical property. It is a phenomenon where a model learns the training data too well, including its noise and outliers, which harms its performance on unseen data. It’s not strictly due to the size of the training data but rather the model’s capacity to learn from it relative to its complexity.
Learning rate multiplied by the number of epochs: The learning rate and the number of epochs are both factors in a model’s training process, but their product is not a direct measure of specialization. Instead, it’s the learning rate’s influence on weight updates over time (across epochs) that can affect specialization. Moreover, a model’s capacity and the regularization techniques applied also significantly influence overfitting.
Example of learning 1 digit addition: The example given is somewhat simplistic and does not fully capture the complexities of overfitting. Overfitting would mean the model performs well on the training data (numbers with 1) but poorly on unseen data (numbers with 5). However, the example also suggests a sampling bias in the training data, which is a separate issue from overfitting. Sampling bias can lead to a model that doesn’t generalize well because it hasn’t been exposed to a representative range of the problem domain.

Overall, while the intention of the statement is to describe overfitting and the effects of learning rates, it conflates different concepts and could benefit from clearer differentiation between them.

kindacognizant@alien.top · 2 years ago

I am inclined to believe gpt4 since it consistently claims this across both the API and your comment… but I’m not sure

ganzzahl@alien.top · 2 years ago

Much simpler than GPT-4 – the person above seems to be referring to gradient accumulation (since they mentioned minibatches), where you add up gradients until you reach the target batch size, then apply them. This is perfectly equivalent to training on a larger batch.

Actually training on small batches with a low learning rate, however, and applying the gradients immediately, is definitely not equivalent to a bigger batch with a bigger learning rate, especially if you’re in a particularly unstable part of parameter space, where large learning rates might overshoot. On the other hand, the tiny batches would tend to make the direction your model moves somewhat random, which might be good, might be bad.

Whether or not this actually does what OP wants it to is really just an empirical question. If they did it, and it worked better than bigger batches with the same data, then I guess it helped (in this case with this model and this data), haha

Revolutionalredstone@alien.top · 2 years ago

Good info thanks for that!