NeuralHermes-2.5: Boosting SFT models' performance with DPO

mlabonne@alien.top · 3 years ago

NeuralHermes-2.5: Boosting SFT models' performance with DPO

mlabonne@alien.top · 3 years ago

Yes, I’d say it’d probably work better than the current approach. If you look at the reward plots on wandb, it feels like the problem is too easy for the model, hence slight improvement.

https://preview.redd.it/xhuyiquojg3c1.png?width=2398&format=png&auto=webp&s=67725747e6cd9254e38728149fb6cea3ba85d71e

ganzzahl@alien.top · 3 years ago

I find it odd that your chosen rewards went negative… Doesn’t this imply that the chosen samples became less likely than they were under the base model? You still get model improvements, since the rejected rewards got even less likely, but it’s still odd feeling. Any thoughts there?