georgejrjrjr@alien.topB to

LocalLLaMAEnglish · 2 years ago

Identity-PO: DeepMind takes the ELO out of DPO

6

1

Identity-PO: DeepMind takes the ELO out of DPO

georgejrjrjr@alien.topB to

LocalLLaMAEnglish · 2 years ago

6

https://x.com/kylemarieb/status/1728281581306233036

New DeepMind paper just dropped.

Background: Direct Preference Optimization (DPO) is the simpler, more robust, higher performing successor of RLHF used in Zephyr, Intel’s new model, and others.

Identity-PO simplifies DPO, removing its reliance on ELO scores (and the mathematical assumptions that come with them). The authors claim this solves overfitting, which is huge if true.

The trend towards simpler solutions and sounder mathematical grounding in alignment is fun to watch. These inscrutable matrices are looking awfully controllable, and the failure modes of the old methods were things like wedding party collapse.

Chat

big_ol_tender@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
It’s already available in huggingface dpo trainer too
- georgejrjrjr@alien.topOPB
  link
  fedilink
  English
  arrow-up
  1·
  2 years ago
  Ty, that’s helpful to know.