shibe5@alien.topBtoLocalLLaMA•Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methodsEnglish
1·
1 year agoWith the abundance of models, most developers and users have to select a small subset of available models for own evaluation, and that has to be based on some already available data about models’ performance. At that stage, selecting models with, for example, highest MMLU score is one way to go about it.
Own web UI for experimenting.