Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

Covid-Plannedemic_@alien.top · 2 years ago

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

its_just_andy@alien.top · 2 years ago

if you’re interested in running your own models for any reason, you really should build your own evaluation dataset for the scenarios you care about.

at this point, all the public benchmarks are such a mess. Do you really care if the model you select has the highest MMLU? Or, do you care only that it’s the best-performing model for the scenarios you actually need?

Exios-@alien.top · 2 years ago

This seems to me at least like the most logical conclusion. I’m currently working on developing some level of moral/ethical dilemma scenarios to interpret different perspectives and response strategies, for my personal use cases of discussion and breaking down topics into manageable levels and then exploring the nuances, it is very effective. Seems to be far too broad of a “use case” to define one set of benchmarks unless it’s incredibly comprehensive and refined over and over as trends develop

shibe5@alien.top · 2 years ago

With the abundance of models, most developers and users have to select a small subset of available models for own evaluation, and that has to be based on some already available data about models’ performance. At that stage, selecting models with, for example, highest MMLU score is one way to go about it.

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org