NeuralHermes-2.5: Boosting SFT models' performance with DPO

mlabonne@alien.top · 3 years ago

NeuralHermes-2.5: Boosting SFT models' performance with DPO

onil_gova@alien.top · 3 years ago

New favorite model!

onil_gova@alien.top · 3 years ago

what does it feel like to generate tokens?

https://preview.redd.it/ypt1we1cmf3c1.png?width=681&format=png&auto=webp&s=c3a0fd98e41fbd2fffd725bd34124c2c7f887715

petitmottin@alien.top · 3 years ago

a_beautiful_rhind@alien.top · 3 years ago

Would be cool to see this in a 34b and 70b.

Informal-Ad-534@alien.top · 3 years ago

It holds up pretty decent! What Mirostat Tau value would you recommend with it?

kpodkanowicz@alien.top · 3 years ago

really cool! what do you think about using gpt3.5 as the worst output in the hopes to resurface some extra edge?

mlabonne@alien.top · 3 years ago

Yes, I’d say it’d probably work better than the current approach. If you look at the reward plots on wandb, it feels like the problem is too easy for the model, hence slight improvement.

https://preview.redd.it/xhuyiquojg3c1.png?width=2398&format=png&auto=webp&s=67725747e6cd9254e38728149fb6cea3ba85d71e

ganzzahl@alien.top · 3 years ago

I find it odd that your chosen rewards went negative… Doesn’t this imply that the chosen samples became less likely than they were under the base model? You still get model improvements, since the rejected rewards got even less likely, but it’s still odd feeling. Any thoughts there?

perlthoughts@alien.top · 3 years ago

nice job!

actualopenai@alien.top · 3 years ago

works really well to get it on the 16k version https://huggingface.co/NurtureAI/OpenHermes-2.5-Mistral-7B-16k
would it have to be a different dataset?

Creative_Bottle_3225@alien.top · 3 years ago

what is the difference between normal and 16 K?

mlabonne@alien.top · 3 years ago

It’s a good question, I can give it a try. Ideally, you’d want a 16k version of the preference dataset to make sure that DPO doesn’t ruin it. But considering the low number of training samples, it probably works fine.

Wonderful_Ad_5134@alien.top · 3 years ago

The improvement is so small it can be a margin of error