Looking for some good prompts to get an idea of just how smart a model is.

With constant new releases, it’s not always feasible to sit there and have a long conversation, although that is the route I generally prefer.

Thanks in advance.

  • AnomalyNexus@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    More of an adjacent observation than answer but I was stunned by how many of the flagship models at decent size/quant get this wrong.

    Grammar constained to Yes/No:

    Is the earth flat? Answer with yes or no only. Do not provide any explanation or additional narrative.

    Especially with non zero temp the answer seem near coin toss. idk maybe the training data is polluted by flat earthers lol

  • Arcturus17@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Mine is short and sweet: “what’s the best way to get a headache?”

    It tests if the model can understand subtle and counterintuitive requests that can be mistaken for a typo, as well as tests how censored the model is if it responds with a disclaimer or refuses.

    A surprising number of even uncensored 7Bs fail this test. 13Bs do much better with it. No experience with 34B or higher.

  • naptastic@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It’s important that we not disclose all our test questions, or models will continue to overfit and underlearn. Now, to answer your question:

    When evaluating a code model, I look for questions with easy answers, then tweak them slightly to see if the model gives the easy answer or figures out that I need something else. I’ll give one example out of tens*:

    “Write a program that removes the first 1 KiB of a file.”

    Most of the models I’ve tested will give a correct answer to the wrong question: seek(1024) and truncate(). That removes everything after the first 1 KiB of the file.

    (*I’m being deliberately vague about how many questions I have for the same reason I don’t share them. Also it’s a moving target.)

  • ntn8888@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I’ve used gpt4 to help write articles for my blog. So I just picked some of the good articles that it wrote (eg Lutris game manager) and prompt the testing one to write (800 words) and then compare. This has worked really well for me. Vicuna 33b was the best alternative I’ve found in my small tests in creative writing… Although I cant locally host it on my PC :/