Extra proof (IMO) that HumanEval is leaked in base models?

kpodkanowicz@alien.top · 2 years ago

really cool! what do you think about using gpt3.5 as the worst output in the hopes to resurface some extra edge?

kpodkanowicz@alien.top · 2 years ago

started asking it as well - seems to be very hard for 34b models to get it fully right @1

kpodkanowicz@alien.top · 2 years ago

HumanEval is 164 function declarations and corresponding docstrings, and evaluation happens by set of unit tests while code is running in docker. Extra is coming from HumanEvalPlus that added several unit tests per each on the top.

Merging models might improve its capabilities, but this one was not able to find out of bounds of wrongly declared vector - there is no chance it magically is able to complete complex python code on the level that is basically on GPT4 level

kpodkanowicz@alien.top · 2 years ago

Extra proof (IMO) that HumanEval is leaked in base models?

kpodkanowicz@alien.top · 2 years ago

you are on fire. This is your yet another great post - btw. i changed perplexity scripts to only measure responses after the instruction and using for example, the evol dataset. The preset is configured accordingly to the model - i got completely different results than normal perplexity - interestingly, when running code isntructions on normal model and for instance roleplay instructions on coding model not just perpelxity is around 1 vs. 3 but also degradate differently

kpodkanowicz@alien.top · 2 years ago

It’s not just great. It’s a piece of art.

kpodkanowicz@alien.top · 2 years ago

I have never gotten Flash Attention to work despite testing both paddings, but I’m due to do a clean installation sometime next month. Currently, I use padding Right without FA.

afaik you need to run model = get_peft_model, as you need to pass peft model as argument to sfttrainer

kpodkanowicz@alien.top · 2 years ago

Guding output was already mentioned but maybe I will mention how this can be done even with very weak model.

You use text complete end point where you will be constructing your prompts.
You specify context and make it stand out as a separate block
Then in a prompt you ask to fill a specific detail (just one to the JSON)
In the completeion part (i.e. after assistant) you already pre-write out put in JSON format with first value,
You stop streaming after " sign
change the prompt to ask for the next value, add it as next atribute to the JSON you are generating and again start generation and stop with "

Very, very fast -you barely generate any tokens mostly eval prompts.

Test manually once you you have good result ask GPT4 to write you a python wrapper to do it.

kpodkanowicz@alien.top · 2 years ago

This is really interesting work!!! I’m doing research on Contrastive Decoding and have pretty good results so far, moreover reading this post I realized it might fix my issues with picking the right alpha.

I have a suggestion to make to OP and people reading this post - could we start collecting “goto” questions that this community uses for testing? IT will be easier to automate and then publish all outputs at once and let people rank whether they like the output or not.

This way it will be much easier for small teams and individuals to conduct meaningful progress

kpodkanowicz@alien.top · 2 years ago

wow, and also they use chatml format… I know some top ppl here started to use it, but i wonder if they know something the rest of us doesnt :D

kpodkanowicz@alien.top · 2 years ago

lol, I will stop wasting my time now - I spent roughly 3 hours today trying to get it to work :D Mostly around GGUF

kpodkanowicz@alien.top · 2 years ago

i have some issues with flash attention and with 48gb i can go up to 512 rank with batch size 1 and max len 768. My last run was 1024 max len, batch 2, gradient 32, rank 128 and gives pretty nice results

kpodkanowicz@alien.top · 2 years ago

I think there is room for everyone - Text Gen is a piece of art - it’s the only thing in the whole space that always works and is reliable. However, if im building an agent and getting a docker build, I can not afford to change text gen etc.

kpodkanowicz@alien.top · 2 years ago

hmm, one of the really interesting details here - normal lora in rank 8 tested better than in rank 128 - genuine question - how is it possible? medicore data used for lora? I have done few finetunes recently and see a similar situation between rank 128 and 256

kpodkanowicz@alien.top · 2 years ago

Really nice, I had a dreamz we need to find a way to iterate over base models so every finetune is closer to sota :D

kpodkanowicz@alien.top · 2 years ago

Great work as always! Regarding Exl2 its sensitive to calibration dataset - probably the one that was used is not related to your tests. I.e. you can get higher scores in HumanEval even in 3 bits that you would get in transformers 8bit. I hope that this standard will get more popular and finetuners will do their own measurement file/quants using their dataset. Never seen q2 gguf doing better than exl2 unless i mixed rope config.

Edit - for anything higher than 4.25bit i usually use 8bit head

kpodkanowicz@alien.top · 2 years ago

amazing!!! I bet this approach (and optionally lora routers)will be our only shot to beat gpt4 and beyond.