Train Smarter, Not Harder? - MiniSymposium 7b

kindacognizant@alien.top · 2 years ago

Train Smarter, Not Harder? - MiniSymposium 7b

StaplerGiraffe@alien.top · 2 years ago

You are correct. Small learning rate allows to do fine adjustments to parameters and thereby learning subtle features. However, initially learning subtle features is useless, since you need to learn the coarse features first. That’s why learning rate schedulers go from large learning rate to small learning rate. The tricky bit is doing the minimal amount of training on a large learning rate. That is where various optimizers come in, which try do automate these kinds of things.

You could try to do this by hand by saving checkpoints periodically, and try to find the point where you go from undertrained to overtrained. Then pick a checkpoint which is slightly undertrained, and start training from there with a lower learning rate.

kindacognizant@alien.top · 2 years ago

Considering there’s an implementation of the cosine scheduler with warmup steps, is there any implementation of a scheduler that starts slow, then rapidly accelerates, and finally stabilizes to learn the subtle features (like a sigmoidal function?) To avoid starting too high in the first place.

https://preview.redd.it/qb1z0n7oci2c1.png?width=1200&format=png&auto=webp&s=15dbab7b3a18ab918defbbbe2ab6816aaa46b489

StaplerGiraffe@alien.top · 2 years ago

Honestly, no idea. I have more theoretical than practical understanding. But my idea of the warmup phase is to arrange the initial totally random weights of a network into something where you can optimize on. When finetuning you don’t start from randomness, you start from a trained checkpoint, so I expect that the warmup phase is pointless (at least for SGD, no idea if it helps adaptive optimizers). So believe you should go from high learning rate to low learning rate, unless somebody knows better.

Oh, and when training Loras, remember that changing alpha also changes the learning rate by the same factor if I remember right. So many tests about optimal alpha are probably invalid, because people didn’t adjust the learning rate.