So RWKV 7b v5 is 60% trained now, saw that multilingual parts are better than mistral now, and the english capabilities are close to mistral, except for hellaswag and arc, where its a little behind. all the benchmarks are on rwkv discor, and you can google the pro/cons of rwkv, though most of them are v4.

Thoughts?

  • MichalO19@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If I am reading this RWKV_v5_demo.py right this is essentially a Retentive Network (so a Linear Transformer) but without the positional encoding, with the token shifts from previous RWKVs, and with trainable matrix valued decay factors (instead of fixed decay factors like in RetNet).

    Gotta say it’s a pretty clean architecture but I will believe it surpasses Mistral when I see it. I don’t think a linear transformer has a serious chance to beat a standard transformer with the same number of parameters.

    It might have a chance for general 0-shot question answering, but I expect it to be much worse in particular for in-context learning/memory tasks, simply because the softmax attention is way more capable than linear attention as a learning algorithm (theoretically it can learn in-context any key->value mapping, while linear attention by definition can only learn linear key->value mappings (whatever that means in the embedding space), and also risks double-writing into memory things it already knows).

    But hey, let’s see.

    • cztomsik@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I have my doubts too. RWKV4 was great, but in practice it was always worse than any LLAMA. I think it might be because it’s way more sensitive to sampling. Because every token destroys the previous state completely. So once it goes wrong way, it will never recover. This happens with other architectures too but all the data are still in the context and the model can still recover but RWKV does not have any (previous) context, so it can’t recover.

      That said, RWKV is awesome and I am super-excited about it. Either we can solve this problem with sampling or we can just slap small attention block on top of it and do fine-tuning then together. Either way, the future is bright in my opinion.

      Also, if you think about it, it’s a miracle that such architecture even works and manages to learn instruction following.

      Also RWKV is great because you can “freeze” the state, save it, and then always just restore it, and continue the conversation (or whatever). Which together with small memory requirements makes it very compelling for serving multiple users without occupying a lot of GPU memory, and also instead of “engineering the prompt” you are really engineering the initial state. Obviously it’s way more sensitive to fine-tuning, it will “revert” to its mood sooner.

    • vatsadev@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Hmm, will have to check this stuff with the people on the rwkv discord server.

      V5 is stable at context usage, and V6 is trying to get better at using the context, so we might see improvement on this

      • MichalO19@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        If I understood correctly the original explanation on github for RWKV, BlinkDL agrees that softmax attention is very capable in theory but he thinks Transformers are not using it to full potential, so theoretically less capable architectures can beat them.

        This might be true, but I kind of doubt it. I played a bit with the 3B RWKV with a prompt like

        User: What is the word directly after "bread" in the following string "[like 20 random words]" 
        Assistant: The word directly after "bread" is "
        

        (note the preferred for RWKV ordering of a question before data, but I tested the other way around too) and unless the query word is very early in the string it gives me a random word. Even 1.3B transformer models seems to answer this correctly much more often (though not always correctly).

    • Maykey@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I don’t think a linear transformer has a serious chance to beat a standard transformer with the same number of parameters.

      I do. Transformers are not good on long range area.. They perform well only if they are backed by better architectures as in case of MEGA.

    • nderstand2grow@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      your comment is so insightful, thank you. if there are resources I can read/watch to learn about this stuff, I’d be happy if you could share them.

    • satireplusplus@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Models are Apache 2.0 afaik, there are not that many base models that can be used commercially without restrictions.

    • _Lee_B_@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      The source is actually available (which is good), but sadly the dataset is not (which is bad, and makes it not truly open, since you can’re reliably reproduce it).

        • _Lee_B_@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          “World = Some_Pile + Some_SlimPajama + Some_StarCoder + Some_OSCAR + All_Wikipedia + All_ChatGPT_Data_I_can_find”

          “some” as in customized.

      • Disastrous_Elk_6375@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Not looking to start drama here, but I feel we’re moving the goalposts a bit here… Source available and under a permissive license is opensource.

        I feel the discussion around training sets is too risky at this point. Everyone is doing at least gray stuff, using dubious-sourced material and I feel like everyone wants to wait out some lawsuits before we can get truthful stuff about datasets.

        • _Lee_B_@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          No, we’re not. Not really.

          You could call this “open source”, yes, but by a very narrow and worthless definition of that, which has always been controversially narrow and abusive. What people MEAN when they say open source is “like Linux”. Linux is based on, and follows the principles of Free Software:

          0) The freedom to run the program as you wish, for any purpose.
          1) The freedom to study how the program works, and change it so it does your computing as you wish. Access to the source code is a precondition for this.
          2) The freedom to redistribute copies so you can help others.
          3) The freedom to distribute copies of your modified versions to others
          -- gnu.org/philosophy
          

          When an LLM model’s weights are free, but it’s censored, you have half of freedom 0.

          When an LLM model gives you the weights, but doesn’t give you the code or the data, AND it’s an uncensored model, you have freedom 0, but none of the others.

          When you have the source code but no weights or data, you only have half of freedom 1 (you can study it, but not rebuild and run it, without a supercomputer and the data).

          When you have the source code, the weights, AND the data, you have all four freedoms, assuming that you have the compute to rebuild the weights, or can pool resources to rebuild them.

          • Disastrous_Elk_6375@alien.topB
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            So you list the gnu stuff, and then add “censored”, but that’s not goalpost moving? Come on.

            0,1,2 and 3 ALL apply with an apache 2.0 license. Saying this is not open-source at this point is being contrarian for the sake of being contrarian, and I have no energy to type on this subject.

            Quoting your own post fron gnu: Take the sourcecode, plug in c4 or redpajama or whatever, pay for the compute and you can get your own product. With the posted source code. I got nothing else.

        • Slimxshadyx@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          You are right but I think a big part of Open source is being able to modify it however you like.

          You can’t really modify anything here except for fine tuning without the og dataset

    • EJBBL@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I tested it. It understands Persian, but not so well, it also hallucinates people.

      • vasileer@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        it also hallucinates people

        and Mistral doesn’t?

        keep in mind that the demo is for 3B model, and the post is about 7B, which I expect to be way better

    • MoffKalast@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Well it seems a lot better at Slovenian than LLamas or Mistral, especially for a 3B model, although it mostly just rambles about stuff that’s vaguely related to the prompt and makes lots of grammatical mistakes. The 7B one ought to be interesting once it’s done.

        • alchemist1e9@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Will that make it a good translator? I remember seeing somewhere a 400+ language translation model but not an LLM somewhere. Wonder what the best many language open source fast high quality translation solutions might look like.

  • Aaaaaaaaaeeeee@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Would the amount of RAM used at the end of 16k or 32k compared to mistral be less?

    Is the t/s the same speed as during the beginning?

    Looks like something to test in kobold.cpp later if nobody has done those tests yet.