🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

WolframRavenwolf@alien.top · 3 years ago

🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

Perimeter666@alien.top · 3 years ago

Goliath is a masterpiece so far. Running it on 4x4090, speed is OK, but not the best still.

For my taste it writes stories better than GPT4 itself, immersing deeper and avoiding useless watery poetic shit GPT4 is full of.

Just give the thing 16k context and with a 16x4096 setup it’ll be divine lol

mcmoose1900@alien.top · 3 years ago

I have… mixed feeling about Capybara’s storytelling, compared to Base YI 34B with the alpaca lora?

I have been trying it with the full instruct sytnax, but maybe it will work better with hybrid instruct/chat sytnax (where the whole story is in one big USER: block, and the instruction is to continue the story.)

sophosympatheia@alien.top · 3 years ago

Another great contribution, Wolfram! I was pleased to see one of my 70b merges in there and it didn’t suck. More good stuff to come soon! I have a xwin-stellarbright merge I still need to upload that is hands down my new favorite for role play. I’m also excited to see what opus can do in the mix.

norsurfit@alien.top · 3 years ago

You’re doing the lord’s work, son…

metalman123@alien.top · 3 years ago

We learned that merging models absolutely works and that the 34b yi model appears to be the real deal.

(Maybe we should merge some yi fine tunes in the future)

FullOf_Bad_Ideas@alien.top · 3 years ago

I am not serious, but the results clearly suggest that what we should try next is to stack 2 various finetunes of Yi-34B onto each other in the same way it’s done in Goliath and then quantize it.

fab_space@alien.top · 3 years ago

I want to share my test with u for reviewing, and hopefully, integration.

how it sounds?

Ok_Relationship_9879@alien.top · 3 years ago

That’s pretty amazing. Thanks for all your hard work!
Does anyone know if the Nous Capybara 34B is uncensored?

Inevitable-Start-653@alien.top · 3 years ago

God I love these posts! Thank you so much 🙏😊

drifter_VR@alien.top · 3 years ago

But why did SauerkrautLM 70B, a German model, fail so miserably on the German data protection trainings tests?

Does it write decent german, at least ?

I ask because I tried another Llama-2-70B model fine-tuned to speak another language than english (Vigogne-2-70b-chat) and I have been disappointed by its poor writing style.

Maybe it’s my settings or the fine-tuning. Or maybe the base model is the issue (relatively small and trained mainly in english)

iChrist@alien.top · 3 years ago

I found out that for a simple task like “list 10 words that end with the letters en” i get only wrong answers with the dolphin 34B variant, while 13B tiegihter gets it right, am i doing something wrong with template?

kindacognizant@alien.top · 3 years ago

> Deterministic generation settings preset

There seems to be a common fallacy that absolute 0 temperature or greedy sampling is somehow the most objective because it’s only picking the top token choice; this isn’t necessarily true, especially for creative writing.

Think about it this way: you are indirectly feeding into the model’s pre-existing biases in cases where there are many good choices. If you’re starting a story with the sentence, “One day, there was a man named”, that man could be literally any man.

On the base Mistral model, with that exact sentence, my custom debug kobold build says:

Token 1: 3.3%

Token 2: 2.4%

Token 3: 1.6%

Token 4: 1.6%

Token 5: 1.18%

Token 6: 1.15%

Token 7: 1.14%

Token 8: 1.03%

Token 9: 0.99%

Token 10: 0.98%

When the most confidence the model has in a token is 3.3%, that implies you’d want to keep the selection criteria just as diverse, because in reality that slight bit of confidence is only because it has a generic name for the top token.

Whatever the most likely token is only the most likely token for that particular token given the past context window: a deterministic preset is not creating generations that are overall more coherent. In fact, it causes models to latch onto small biases caused by tokenization, which manifests as repetition bias.

The Deterministic preset in ST also has a rather high repetition bias of 1.18; this is causing the model to subtly bias against things like asterisks and proper formatting, which are important to test for in a model.

LosingID_583@alien.top · 3 years ago

Maybe the multiple choice questions are too easy at this point…

drifter_VR@alien.top · 3 years ago

u/WolframRevenwolf

Yet another potential benchmark :)
mirostat vs min-P

https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/

Kou181@alien.top · 3 years ago

While I’m only beginning to use dolphin2-yi-34b, it’s giving me good results much consistent and creative than any of 7b or 13b models I’ve used so far! I’ll update the comment when I find something lacking in the future. Your reviews are really helping people like me who don’t have beefy pc or dedication to test tens of different models thoroughly, thank you.

🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

Models tested:

Testing methodology

1st test series: 4 German data protection trainings

Observations:

Conclusion: