Goliath-120B - quants and future plans

AlpinDale@alien.top · 2 years ago

Goliath-120B - quants and future plans

randomfoo2@alien.top · 2 years ago

It depends on the use case. Each model may have their own strengths. I picked XWin and Airoboros as baseline 70B models for 2nd language conversational testing, and XWin outperformed (in human-evaled testing with a native speaker) a 70B model that had been pre-trained on an additional 100B tokens of said 2nd language. Shocking to say the least.

a_beautiful_rhind@alien.top · 2 years ago

My test was logs of chats with characters. Something that isn’t widely publicly available so it can’t be gamed. Xwin has very bad perplexity on those. Below that of codellama-34b.

xwin: 4.876139163970947

Codellama: 4.689054489135742

Same quantization…

70b-base scores: 3.69110918045044 Euryale-1.3: 3.8607137203216553

Dolphin 2.2 did surprisingly bad: 4.39600133895874 but not as bad as xwin.

Obviously it doesn’t 100% track to a good model but all things combined about xwin (refusals, repeat issue, perplexity) put me off from it in a big way.