Anyone Tried Adding New Languages to Open Source AI Models? Need Advice!

nefarkederki@alien.top · 1 year ago

WaterdanceAC@alien.top · 1 year ago

I havent worked on this personally, but I like to keep an eye out for projects like this. Some resources/thoughts - dataset: https://huggingface.co/datasets/allenai/MADLAD-400 and the bilingual arabic/English project Jais found that training the model with some coding abilities proved helpful. https://www.cerebras.net/blog/jais-a-new-pinnacle-in-open-arabic-nlp Good luck!

AutomataManifold@alien.top · 1 year ago

I know there’s several projects for finetuning llama for Chinese. I haven’t worked on them but it might be worth looking in to what they did.

nefarkederki@alien.top · 1 year ago

Hey there! Thanks for the tip. I have made some research and found out this : https://github.com/ymcui/Chinese-LLaMA-Alpaca

So for those who are interested here is what I understand from what needs to be done :

If Llama2 tokenizer does not support your language, you need to expand that vocabulary first
You’ll need data for further fine tuning, and also for instruction tuning.
And you will need money :D For training