https://huggingface.co/kalomaze/MiniSymposium-Demo
MiniSymposium is an experimental model that I created based on Mistral 7b. I created it attempting to test these goals:
- Demonstrate the untapped potential of using a small, focused dataset of handwritten examples instead of training on a large amount of synthetic GPT outputs, by lowering the learning rate and doing many passes over the small dataset
- Create a dataset that allows the model to explore different possible answers from multiple perspectives before reaching a final conclusion (‘Socratic prompting’?)
- Develop a model that performs well across various pseudo-markdown prompt formats, rather than overfitting to a specific kind of format such as ChatML, which should naturally benefit other general purpose use cases
The current trend in QLora/Lora-based finetuning (and finetuning in general for local LLMs) is to use large synthetic datasets. These are typically GPT-generated datasets trained with higher learning rates.
However, I believe there is a lot of potential in using small, hand-written datasets with low learning rates, even if it’s for general-purpose instruction following, as long as you train it for many epochs on a learning rate low enough to avoid overfitting.
This approach, I hypothesize, helps the model to learn the deeper patterns of instruction following , including the small details. This should help to avoid shallow data biases (like “As an AI made by OpenAI” and other GPT-isms) that are irrelevant to deeper instruction following patterns, especially in long context and multiturn scenarios.
My initial configuration for this QLora model used a constant learning rate of 1e-6 (0.000001), which resulted in obvious, massive overfitting after about 100 epochs. The model started reproducing the original dataset almost verbatim, and exhibited poor generalization across different prompt formats, including obvious hallucinations & also Chinese language outputs for some reason.
However, turning down the learning rate to 1/10th of (1e-7, which is 0.0000001) significantly improved the model with the same exact small dataset. I trained for about ~10 hours on my RTX 3060 to 600 epochs; I think it’s still a little undertrained, but I encourage people to try the demo model out in the meantime.
It’s designed to be very adaptable to different prompt formats and playing roles, and I’ve gotten some fun and sometimes surprisingly good outputs so far.
A few samples of the training data are formatted like this to help avoid blatant overconfidence in its outputs, to serve as a sort of self-correction mechanism:
Let me know how this model goes. There’s lots of merges of models that are all sort of doing the same thing, so I figured a more experimental approach would be appreciated. I think there is still more optimization for LR/epoch balance, and I’ll probably add some more examples of specific tasks like Summarization in the dataset so that it’s not *too* small (but still lightweight enough to generalize well).
Oh yes it is. The whole point of gradient descent is to slowly explore the dimensions of the gradient. With smaller steps you have a totally different trajectory than with bigger steps. And every pass makes you move.
If you choose a too small learning rate you often will indeed just move slower on the same path but a too big learning rate makes you skip entire paths.
OP seems to have been in that case with their first attempt.
So you’re saying my intuition isn’t wrong, necessarily, that slow training to learn the small subtle details could work as long as the dataset wasn’t *too* limited in scope?
You are correct. Small learning rate allows to do fine adjustments to parameters and thereby learning subtle features. However, initially learning subtle features is useless, since you need to learn the coarse features first. That’s why learning rate schedulers go from large learning rate to small learning rate. The tricky bit is doing the minimal amount of training on a large learning rate. That is where various optimizers come in, which try do automate these kinds of things.
You could try to do this by hand by saving checkpoints periodically, and try to find the point where you go from undertrained to overtrained. Then pick a checkpoint which is slightly undertrained, and start training from there with a lower learning rate.
Considering there’s an implementation of the cosine scheduler with warmup steps, is there any implementation of a scheduler that starts slow, then rapidly accelerates, and finally stabilizes to learn the subtle features (like a sigmoidal function?) To avoid starting too high in the first place.
https://preview.redd.it/qb1z0n7oci2c1.png?width=1200&format=png&auto=webp&s=15dbab7b3a18ab918defbbbe2ab6816aaa46b489
Honestly, no idea. I have more theoretical than practical understanding. But my idea of the warmup phase is to arrange the initial totally random weights of a network into something where you can optimize on. When finetuning you don’t start from randomness, you start from a trained checkpoint, so I expect that the warmup phase is pointless (at least for SGD, no idea if it helps adaptive optimizers). So believe you should go from high learning rate to low learning rate, unless somebody knows better.
Oh, and when training Loras, remember that changing alpha also changes the learning rate by the same factor if I remember right. So many tests about optimal alpha are probably invalid, because people didn’t adjust the learning rate.
Good to know thank you