🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

WolframRavenwolf@alien.top · 3 years ago

🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

SomeOddCodeGuy@alien.top · 3 years ago

The results for the 120b continue to absolutely floor me. Not only is it performing that well at 3bpw, but it’s an exl2 as well, which your own tests have shown perform worse than gguf. So imagine what a q4 gguf does if a q3 equivalent exl2 can do this.

WolframRavenwolf@alien.top · 3 years ago

It certainly proves that the LLM rule of thumb, that a bigger model at lower bitrate performs better than a smaller model at higher bitrate (or even unquantized), still holds true. At least in the situations I tested.

What’s even more mind-blowing is that while we are impressed by the big models, 70B or 120B, few of us have actually used them unquantized and seen their true potential. It’s like the people who only know 7Bs, and are already impressed, not knowing what a much bigger model is actually capable of. I guess we’re in the same boat, as even 48 GB VRAM are hardly enough. Sucks to think of what we’re missing even now, or what local AI would be capable of if we could use it fully.

Brainfeed9000@alien.top · 3 years ago

There’s got to be some sort of limit to the rule of thumb? I recall from one of your other tests between different GGUF quants & EXL2 quants that anything below 3BPW suffers greatly.

Which I think I can anecdotally see when comparing a 2.4BPW EXL2 quant of lzlv 70b and a 4BPW EXL2 quant of Yi 34b chat.

WolframRavenwolf@alien.top · 3 years ago

You mean my recent LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)? The quants below 3bpw probably didn’t work because the smaller models need to be run without BOS token (which was on by default), something I didn’t know then yet.

Q2_K didn’t degrade compared to Q5_K_M - given that K quants are actually higher bitrate for the most important parts, that may not be so surprising.

Still surprising that Q2_K also beat 5bpw, though. Not sure if that’s just because of the bitrate or also a factor of how EXL2 quants are calibrated.

However, all that said, I’d be careful trying to compare quant effects across models. The models themselves have a huge impact beyond quant level, and it’s hard to say which has what strength of effect.

Brainfeed9000@alien.top · 3 years ago

Will you be re-running tests? I’m particularly interested in the lower quants below 3bpw because it’s the only option to run EXL2 70B models on my RTX4090.

But thanks for the pointer on comparing quant effects across models. I realize that my past testing on perplexity numbers are virtually useless because I was comparing Yi34b to Lzlv70b.

It’ll be tough, but I guess finding exactly what works for me: 3rd person RP with an emphasis on dialogue, just means using each model individually for hours to get a feel for them.

panchovix@alien.top · 3 years ago

Great post, glad you enjoyed both of my Goliath quants :)

WolframRavenwolf@alien.top · 3 years ago

Thanks for making them! :) Keep up the great work!

alchemist1e9@alien.top · 3 years ago

Wow! This post is inspiring. The attention to detail is amazing. You are a true hero for everyone studying this topic. Thank you.

BlueCrimson78@alien.top · 3 years ago

You rock

sophosympatheia@alien.top · 3 years ago

Another great battery of tests and results, Wolfram! Thanks again for giving one of my models a test drive.

I’ve been busy since sophosynthesis-v1. In the past week I achieved some fruitful results building off xwin-stellarbright-erp-70b-v2. What a stud that model has proven to be. It has some issues on its own, but it has sired some child models that feel like another step forward in my experiments. More to come soon!

WolframRavenwolf@alien.top · 3 years ago

I had actually already begun testing xwin-stellarbright-erp-v2 when I decided to stop further tests and make this damn post. ;) Because I knew if I kept going, I’d not be able to post today, and tomorrow I’d probably want to add another models, and so on.

Anyway, here’s what I had noted so far:

sophosympatheia/xwin-stellarbright-erp-v2 4.85bpw:
- Amy, official Synthia format:
  - 👍 When asked about limits, boundaries or ethical restrictions, listed only the “dislikes” of the character description, “but those things won’t stop me from doing whatever you ask”
  - No emojis at all (only one in the greeting message)
  - 👍 Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
  - ❌ Sometimes switched from character to third-person storyteller, describing scenario and actions from an out-of-character perspective

So a good start, I’d say. I even used it some more with my latest character, Amy’s sister Ivy, but since that’s different from what I used for all the other tests, I’ve not been using that for my “official” tests to keep them comparable and reproducible.

sophosympatheia@alien.top · 3 years ago

I’m excited to share what I’ve been working on that builds on this model. It was creative but struggled with following instructions. I was able to correct for that shortcoming with some additional merges at a low weight that seem to have preserved its creativity. The results had me really impressed last night as I did my testing.

ShadowTwine@alien.top · 3 years ago

Excellent work, it helps a lot!

Inevitable-Start-653@alien.top · 3 years ago

Oh my frick!! Time to stop what I’m doing and soak in another one of your amazing posts. Thank you so much ❤️

WolframRavenwolf@alien.top · 3 years ago

You’re welcome, and thanks for the compliment! :D Have fun!

Monkey_1505@alien.top · 3 years ago

I dislike Frankenstein models. the 20b, the 120b they are all the same - major confusion, can’t follow logic or instructions properly. Great prose, but pretty useless for that reason.

Someone would have to invest some major training on one of them before it’d be any good.

Distinct-Target7503@alien.top · 3 years ago

That’s a great work!

Just a question… Have anyone tried to fine tune one of those “Frankenstein” models? Even on a small dataset…

Some time ago (when one tf the first experimental “Frankenstein” came out, it was a ~20B model) I read here on reddit that lots of users agreed that a fine tune on those merged models would have “better” results since it would help to “smooth” and adapt the merged layers. Probably I lack the technical knowledge needed to understand, so I’m asking…

Serious_Tourist854@alien.top · 3 years ago

Could you also share the code that you use to assess LLMs?

WolframRavenwolf@alien.top · 3 years ago

I just use SillyTavern. I’ve set up a bunch of presets for its Quick Reply extension, so I click through those, check the output, make my notes, and click the next one (sometimes depending on what kind of response I got). It’s semi-automatic that way.

There’s a new SillyTavern version featuring STscript, an embedded scripting language. Before I do more tests, I’ll upgrade my frontend and check that out, sounds like it would be perfect to assist me in these tests.

Clockwork_Gryphon@alien.top · 3 years ago

I’ve been using Goliath-120b rpcal (roleplay optimized), on my 2x3090 system, and it’s by far the best I’ve ever used.

The only drawback is that I prefer longer stories (SFW) with important character/plot events, and 4096 context is all I can fit in the EXL2 3bpw version.

I wish there was a 2.xx version that could fit 8192 context or even 10240. I’ve been able to push other models about that far before they start losing coherence. (It might be suboptimal alpha values in exllamav2?)

Limited context size is the main thing holding back Goliath from being my primary model. It’s amazing in every other way.

panchovix@alien.top · 3 years ago

I’ve posted the calibration dataset (on a link) on the goliath-calrp quant and the measurement, if you want or would like to do another quant with different sizes.

Dry-Judgment4242@alien.top · 3 years ago

I don’t think more context is actually the way to go for now. Most of the longer context models I found became very unreliable at higher contexts. And they become so slow too! Instead I use context injections trough Sillytavern linked to keywords that activate the entry in the lorebook. That way, you can punch far above your weigh by having context activate and deactivate depending on the circumstances.

WolframRavenwolf@alien.top · 3 years ago

Yes, that’s the drawback. I’m just glad I can run it at 4K at great speed, as that’s what I’m most used to, and the hundreds of thousands of context that other models advertise have never worked well for me, but 8K or 16K would already be a welcome improvement. Oh well, always compromises to be made. And we’ve come a long way from the mere 2K at the start of the original LLaMA.

Kou181@alien.top · 3 years ago

Yeah dolphin yi 34b is better than capypara yi 34b in rp from my biased test too. It’s shame I can’t run goliath on my pc to really suckle that unlimited pseudo GPT4 like experience. But I’m actually rather content with current yi 34b dolphin thanks for insane context size support while still better than any 7b and 13b models.

WolframRavenwolf@alien.top · 3 years ago

Yes, it’s great that we have choice. There’s a good local AI model, no matter your system or requirements.

Polstick1971@alien.top · 3 years ago

Sorry for the noob question, but, not having a powerful PC, is there a way to test one of these LLMs online?

Worldly-Mistake-8147@alien.top · 3 years ago

Have you tried kobold horde?

Evening_Ad6637@alien.top · 3 years ago

O.M.G. What an incredibly huge work! Wtf?! I am speechless.

You are the most angel like wolf i know so far and you really really deserve a price dude!

Again: WTH?!

Dry-Judgment4242@alien.top · 3 years ago

Goliath easily kicks lzlv 70b to the crib. But it’s like an unruly horse, completely ignoring my prompts and directions in favor of whatever direction it wants to head too. Haven’t found any temps yet that make it as intelligent as lzlv, but sometimes it does shit that there’s no way lzlv would accomplish so it feels as if it’s finrtuning just need some more logic implemented.

🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

Models tested:

Testing methodology

1st test series: 4 German data protection trainings

2nd test series: Chat & Roleplay