I noticed I never posted this before - during experimenting with various merges after merging Phind v2, Speechless finetune and WizardCoder-Python34B each with 33% / averaged then adding Airoboros PEFT on the top I consistently have:
{‘pass@1’: 0.7926829268292683}
Base + Extra
{‘pass@1’: 0.7073170731707317}
Instruct prompt, greedy decoding, seed=1, 8bit.
Phind and Wizard has around 72%, Speech 75%, Airo around 60%
(That would have been SOTA back then, this is also a current score of Deepseek-33B)
The model is rather broken - it has not passed any of my regular questions. That would mean in my opinion, that by a lucky stroke, I broke the model in a way that some of the former data has resurfaced. Let me know what you think,
If someone is very interested I can push it to HF, but its waste of storage
What do you mean as base + extra?
Merging models can be unpredictable, it isn’t an established science yet. It can absolutely make it better at a particular benchmark than any of it’s component is. I don’t think it’s any evidence to be honest.
HumanEval is 164 function declarations and corresponding docstrings, and evaluation happens by set of unit tests while code is running in docker. Extra is coming from HumanEvalPlus that added several unit tests per each on the top.
Merging models might improve its capabilities, but this one was not able to find out of bounds of wrongly declared vector - there is no chance it magically is able to complete complex python code on the level that is basically on GPT4 level