• 1 Post
  • 16 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle


  • HumanEval is 164 function declarations and corresponding docstrings, and evaluation happens by set of unit tests while code is running in docker. Extra is coming from HumanEvalPlus that added several unit tests per each on the top.

    Merging models might improve its capabilities, but this one was not able to find out of bounds of wrongly declared vector - there is no chance it magically is able to complete complex python code on the level that is basically on GPT4 level



  • you are on fire. This is your yet another great post - btw. i changed perplexity scripts to only measure responses after the instruction and using for example, the evol dataset. The preset is configured accordingly to the model - i got completely different results than normal perplexity - interestingly, when running code isntructions on normal model and for instance roleplay instructions on coding model not just perpelxity is around 1 vs. 3 but also degradate differently




  • Guding output was already mentioned but maybe I will mention how this can be done even with very weak model.

    You use text complete end point where you will be constructing your prompts.
    You specify context and make it stand out as a separate block
    Then in a prompt you ask to fill a specific detail (just one to the JSON)
    In the completeion part (i.e. after assistant) you already pre-write out put in JSON format with first value,
    You stop streaming after " sign
    change the prompt to ask for the next value, add it as next atribute to the JSON you are generating and again start generation and stop with "

    Very, very fast -you barely generate any tokens mostly eval prompts.

    Test manually once you you have good result ask GPT4 to write you a python wrapper to do it.


  • This is really interesting work!!! I’m doing research on Contrastive Decoding and have pretty good results so far, moreover reading this post I realized it might fix my issues with picking the right alpha.

    I have a suggestion to make to OP and people reading this post - could we start collecting “goto” questions that this community uses for testing? IT will be easier to automate and then publish all outputs at once and let people rank whether they like the output or not.

    This way it will be much easier for small teams and individuals to conduct meaningful progress








  • Great work as always! Regarding Exl2 its sensitive to calibration dataset - probably the one that was used is not related to your tests. I.e. you can get higher scores in HumanEval even in 3 bits that you would get in transformers 8bit. I hope that this standard will get more popular and finetuners will do their own measurement file/quants using their dataset. Never seen q2 gguf doing better than exl2 unless i mixed rope config.

    Edit - for anything higher than 4.25bit i usually use 8bit head