Hello!

By popular demand I am planning a fine-tune of https://huggingface.co/dreamgen/opus-v0-7b on top of Yi-34B and wonder whether to use the 200K as the base.

The regular Yi-34B seems slightly better than Yi-34B-200K on standard benchmarks, but I wonder how it “feels” and whether the loss of performance on short context is worth it, given that the regular version can be used up to 32K tokens.

(Yi-34B vs Yi-34B-200K)

Did anyone try an analysis of these 2 models on various sequence lengths (<4K, <8K, <16K, etc.)?

  • m98789@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Yi is not trustable on standard benchmarks because they are easy to game by including them in training data and the LKF gang who built this has a high pressure to justify their 1 billion dollar valuation and continue to milk investors.

    The only way to really evaluate this is on some hidden benchmark never seen before and / or rigorous qualitative experiments.

    Until then, I’m not holding my breath.

    • wind_dude@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I believe they said they’re going to release training data. We’ll see. That’s about the only way to easily verify what made it in.

    • mcmoose1900@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I felt this too. It seems to “grab on” when you give it a longer context to continue though.

    • FullOf_Bad_Ideas@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It’s supposed to be a base model and not Instruction finetuned model. That’s how base models generally behave unless they are sold as base but actually finetuned (llama 2 base models).

  • DataLearnerAI@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    In most scenarios, models with extended context are optimized for long sequences. If the sequence is not very long, it is often recommended to use a regular model

  • No-Link-2778@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I have trained book3 for 1 day on a number of GPUs on the 200k, 34B & 6B, it is totally garbage.
    It is not a BASE model at ALL. It even knows itself as GPT sometimes. It was a SFT model on format of benchmarks.
    Try it before you do silly things, you would not find it on SFT immediately, but sooner or later.

    • dogesator@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It referring to itself as a GPT could just be from pre-training internet data if it was trained on internet data from 2023.

      • BlueMetaMind@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        It sounds rather like it trained on chatGPT output and they didn’t curate it enough to delete those “As a large language model trained by openAI…” category statements.

        It’s kinda like Shutterstock watermarks showing up in image generation.

        • dogesator@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Yea I’m saying that ChatGPT outputs are contained on internet posts in the year 2023, so simply training from 2023 internet data would end up with training on ChatGPT data as a side effect.

          • BlueMetaMind@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Yes, I understood you. My claim differs in that I think they DIRECTLY used a lot of GPT4 output through the api, which is very probable because a lot of LLM training is done that way. You ask GPT4 to generate examples of conversations with properties you want your LLM to learn and then train on that.

            In order for self identification, as GPT I don’t think that randomly crawled chat Examples from the Internet would be enough.

            I am not trying to make a strong claim on that, it’s just a thought. My people both.

  • mcmoose1900@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I am running my story on 200K, feels the same as 4K to me (which I tried in the same setting before 200K was released).

    And honestly… Even if it is much worse (and I dont think it worse at all), the mega context is such a boon for storytelling.

    What I did not try was 4K stretched out with RoPE alpha or anything like that, but the 200K model does not need any stretching out to at least 42K.

  • Sabin_Stargem@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Doctor Shotgun has made some Yi-34bs with Airoboros and LimaRP. You might want to talk with them and try out their version. Measure twice and cut once, as they say.

  • mcmoose1900@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Random update on this, I did some more experimenting on the start of a story (with LimaRP and Petrol LoRAs), and the 4K model seems… fine? So does the 200K.

    I don’t how know to stretch out the base model. Their page claims it supports 32K, but it has a 4K context in the config and no RoPE scaling section. Just a high rope theta.

    The one difference I did notice is that the 200K model really likes to summarize and reference previous parts of the story. Maybe it was trained on retrieval or summarization examples.