I see there is progress being made on smaller LLMs that have fewer parameters, but as I understand they are just trying to optimize how much information can be fit in a given parameter size. Is there work being done on LLMs that are trained on less information? For example say I want to chat with a PDF, I don’t care for my LLM to speak French, be able to write Python or know that Benjamin Franklin wrote a paper on flatuence (all things RWKV v5 World 1.5B knows).
That is the use case for fine tuning. Full fine tuning is just training a little on the new dataset but if you are not concerned with forgetting things from the original dataset, you can train more.
Also if the new dataset is too small, you should use data augmentation techniques like having a larger LLM rephrase things, maybe training on translations, put things into QA format with another LLM and so on. As training a lot of epochs with a large learning rate in a small dataset will lead to overfitting and pure exact memorization in detriment of understanding and reasoning.
Skipping the pre training on a large dataset / starting from a base model is going to give worse results. It is like trying to teach a newborn baby whatever you are trying to teach the LLM without even trying to teach it anything about the world (even to the point of locking it into a black box and only ever showing your document). In fact it is likely worse, the initialization for the baby has been fine tuned by evolution and we seem to pick up things with way less data than ANNs.
“I want to chat with a PDF, I don’t care for my LLM to speak French, be able to write Python or know that Benjamin Franklin wrote a paper on flatuence (all things RWKV v5 World 1.5B knows).”
This is Prime RAG, bring snippets in, make the model use them. The more knowledge the model has, the better it gets for your usecase as well, as it knows more stuff.
Also, nice using rwkv v5, hows it work for you?
IBM with Granite?