Dear friends,
I decided to write because many are active on HuggieFace with their AI models.
I have been continuously testing AI Models 8/10 hours a day for a year now. And when I say that I test the models, I don’t mean like many do on YouTube to get likes, with type tests. Tell me what the capital of Australia is or tell me who the tenth president of the United States is. Because these tests depress me as well as making me smile. Already 40 years ago my Commodore Vic 20 answered these questions in BASIC language!
I test models very seriously. Being a history buff, my questions are very oriented towards history, culture, geography, literature. So my tests are to try in every way to extrapolate answers and summaries to the AI models.
Now I note with great sadness that models are trained with a lot of data, but there is not enough focus on ensuring that the algorithm is able to extrapolate the data and return it to the user in a faithful and coherent manner.
Now if we want to use templates just to play with creative invented stories like poetry everything can be fine, but when we get serious Open Source templates to be installed locally seem very insufficient to me.
Furthermore, I note that the models are never accompanied in an excellent manner by configuration or preset data which the user often has to try to understand by making various calibrations.
Another issue, the models are always generic, there is no table of models with their actual capabilities.
More guidance would be needed. example This is a model that is good for Medicine, this has been trained with History data etc.
While we find ourselves researching Huggingface in an almost haphazard manner, not to mention total disarray
In Pavero words I want to tell you, since you work hard, you too should ensure that the models, in addition to being filled with data, can then be able to use them and give them to the user.
So let’s take a step forward and improve the progress.
Claudio from Italy
HuggieFace
Because the models u test are not used for that particular purpose, period. Though the thing u ask for could be done by using ur own embedded db. You should also work on ur tone and stop wasting ur life.
The base models are generic (which is a good thing, even 1000 base models wouldn’t cover the usage of every single person). The training on top of that can’t teach much new knowledge. It’s more of a way to teach the model how to use it’s knowledge it already has. If you want a model specialized for your usage either you train your own or you hope that some random guy has the same usage and already did that.
what value are you getting testing models 10 hours a day? unless you are greatly exaggerating those numbers?
when you do serious historical research, time passes so quickly that you don’t even notice. Unfortunately it takes time for the model to give you the information it has stored. However, even with effort I achieved notable results. example the translation of a writing in Runic. something that is difficult today even for specialized people. This is why I say that we need to waste more time making the algorithm understand its purpose. If we limit ourselves to making them just to play it’s a shame
disappointed by trainers
Trainers or those who do fine-tuning?
Also, they are doing it for free. It’s useless to expect GPT-4 like performance if you are expecting that from community models.
If you are old enough to recall Commodore programming I would suggest a better usage of your remaining time on this Earth. No really, this tech is not at the level you seem to be desiring yet.
you should look it as a baby, baby not always listens your commands, but it has other advance points.
Why do you invest so much time testing them? What is your ultimate goal?
I recommend following some fine-tune tutorials to train a history oriented model yourself. You can get decent result with a few megabytes of good quality dataset about the history content you are interested in. It should be a much more interesting activity than testing models all day! If you want the model to recall intricate details, use higher rank LoRAs or try full fine tune rather than parameter efficient fine tunes.
But like others said, open source models we have today are still far from GPT-4. Fine tuning a small model also barely add any new capability to the model, it is only “tuning” it to be knowledgeful in something else. These LLMs are pre-trained with trillions of tokens, a few tens of thousands more will not make it any smarter.
The beautiful thing about open source is that if you feel something is missing or can be improved, you have the power to change it. Try fine-tuning a model. Although you may find that LLMs in their current state will struggle to do what you’re aiming for. They’re not really designed to always be truthful or act as a database of historically accurate facts. They may be able to play a role in a larger system that is capable of those things, but not alone.
Maybe you overbought (like most of us), the “AI” idea. The models have in some random ways compressed the internet, more or less, and then try to decompress it. As their own warnings say, you’re most likely to get out what you most often put in, so you’re only guaranteed to get out the basics that are repeated a million times, everything else is a game of chances. Now, the reason they have their various benchmarks is, they cannot really evaluate the way you’re trying to evaluate, with your brain. Nor can they predict how to make their models better, not even for their own benchmarks. I’d say it is common knowledge that the kind of “thinking” you’re looking for is something that has just started to happen with tools on top of LLMs.
And one last thing the average consumer has not understood about the benchmarks: when their own tests move from, let’s say, 74% to 75%, and there’s no real pattern to how they do it, maybe they tried 10 different times and 9 times it went to 73%, but they only show us the one attempt that was lucky. So basically, when they move higher and higher % in their tests, they’re also committing the ancient sin of “overfitting”, this process of training and finetuning, rinse and repeat, ends up answering questions “for the wrong reasons”, but they don’t care as long as they can show their boss, or the press, some better %. So the models might move from 75% to 85% in their benchmarks and you might get even less of what you’re looking for. Implied in what I wrote is that we need better tools to look into explainable models, and try to weed out the bad explanations with our brains!
All models are designed on a very similar way (fine tuned ones are even on a top level in this regard) if you are expecting Jarvis like model than I am sorry because I think Tony Stark didn’t born yet 😅 Try having another perspective about the models and its limitations and you could actually start being comfortable with what open source give to our community.
Furthermore, I note that the models are never accompanied in an excellent manner by configuration or preset data which the user often has to try to understand by making various calibrations.
Do you mean prompt template? They are provided by more popular makers of fine-tunes (word “trainers” doesn’t sit well with me) but sometimes documentations are lacking. When fine-tuning is as easy as it is right now, writing a good documentation doubles the effort, so I understand that. Myself I prefer spending time on generating dataset or tweaking fine-tuning settings rather then documenting things, that’s kind of a given most people will prefer to do fun stuff in their free time - for vast majority of us it’s a new cool hobby and not paid work, so tedious stuff is left undone.
As for the rest - hallucinations are a hard problem to solve. You can try using something like veryLLM to reduce them a bit. I don’t think there’s a fix for this, or any major hobbyist community effort.
when you do serious historical research, time passes so quickly that you don’t even notice. Unfortunately it takes time for the model to give you the information it has stored. However, even with effort I achieved notable results. example the translation of a writing in Runic. something that is difficult today even for specialized people. This is why I say that we need to waste more time making the algorithm understand its purpose. If we limit ourselves to making them just to play it’s a shame . I have tried many, from the most recent to the fastest. lately I’ve been having the best results running utopia-13b.Q4_K_M.gguf which offers me a decent speed, a passionate tone with friendly dialogue and above all decidedly aware in trying to give accurate results
What are your thoughts on Llama 1 65B, Llama 2 70B, Mistral 7B and Yi-34B models? I was never too fond of using llama 13B, either the first or second version, since you could always find better responses in bigger models.
I think that the difference in GB from 7 to 13 is mainly due to the capacity of the data stored and able to be queried