I have my doubts too. RWKV4 was great, but in practice it was always worse than any LLAMA. I think it might be because it’s way more sensitive to sampling. Because every token destroys the previous state completely. So once it goes wrong way, it will never recover. This happens with other architectures too but all the data are still in the context and the model can still recover but RWKV does not have any (previous) context, so it can’t recover.
That said, RWKV is awesome and I am super-excited about it. Either we can solve this problem with sampling or we can just slap small attention block on top of it and do fine-tuning then together. Either way, the future is bright in my opinion.
Also, if you think about it, it’s a miracle that such architecture even works and manages to learn instruction following.
Also RWKV is great because you can “freeze” the state, save it, and then always just restore it, and continue the conversation (or whatever). Which together with small memory requirements makes it very compelling for serving multiple users without occupying a lot of GPU memory, and also instead of “engineering the prompt” you are really engineering the initial state. Obviously it’s way more sensitive to fine-tuning, it will “revert” to its mood sooner.
I am using Zig for my project (https://www.avapls.com/). Currently, I only do inference, but so far it was awesome and I would definitely do it again.
Zig plays nicely with anything with C interface. ATM I am using GGML but I think if I really needed, I could use torch too.
I also did node.js extension for quick prototypes and so far it worked great. I intend to get back to it one day but it’s currently on hold because I don’t have time.
https://github.com/cztomsik/ggml-js
BTW: some people also use elixir for ML https://github.com/livebook-dev/livebook