minus-squareamroamroamro@alien.topBtoLocalLLaMA•Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methodslinkfedilinkEnglisharrow-up1·1 year ago When a measure becomes a target, it ceases to be a good measure linkfedilink