Finally, a diffusion based LMM!

BalorNG@alien.top · 2 years ago

EXTRERMINATE!

BalorNG@alien.top · 2 years ago

I say:

It has a performance hit, but it remains to be seen if going with a much larger model can compensate for that.
The model needs to be trained from scratch, you cannot finetune an existing model for this apparently…

BalorNG@alien.top · 2 years ago

I mean, you can jailbreak/browbeat chatgpt/Claude into going against guardrails relatively easily, I smash “X” for doubt that Grok is going to be any different. If it will, now THAT is going to huge, if not in a way we’d like to I guess…

BalorNG@alien.top · 2 years ago

That explains why Goliath worked and yours - not so much, I guess…

BalorNG@alien.top · 2 years ago

“Prompt Template: Alpeca” Wut?

Looks like a scam to be fair. I bet if you apply, you’ll get “Just send us 100$ for access!”

BalorNG@alien.top · 2 years ago

Did you do post-merge training and how much?

BalorNG@alien.top · 2 years ago

10s/tok and couple kilowatts of power… ok, if it was as smart as Einstein and as unerring as an Oracle it might make sense, but you can use it for free at Petals at 3 tok/sec and it is most certainly not…

BalorNG@alien.top · 2 years ago

Technically, you can somewhat automate the testing process by creating a script that makes that model aswer a series of questions that are relevant to YOU and are unique (so cannot be gamed by training for benchmarks) and evaluate those yourself.

Make sure you experiment using different sampling methods and run several tests due to inherent randomness of output.

BalorNG@alien.top · 2 years ago

Please dear Tzeench, have someone leak gpt4 in general confusion, I MUST know if this is really 10 7b models in a trench coat :)

BalorNG@alien.top · 2 years ago

My name is Mensch. Uber Mensch.

BalorNG@alien.top · 2 years ago

He MUST become a CEO of Uber, too! :))))

BalorNG@alien.top · 2 years ago

Yea, I’ve had my “honeymoon effect” with some new/large models like, say, Falcon and even Claude: they are inherently random and that affects quality, too. I’ve had great outputs from Falcon, for instance (on Petals), but also long stretches of mediocre and some outright bad… and also sometimes really great and creative output from 7b Mistral, especially with enough prompt tinkering and setting sampling “just right”. Objective evaluations of LMMs is extremely hard and time-consuming!

BalorNG@alien.top · 2 years ago

Can we have some non-cherry-picked examples of writing?

Does not have to be highly nsfw/whatever, but a comparison of goliath writing compared to output from constituent models at same settings and same (well-crafted) prompts will be very interesting to see, and preferably at least 3 examples per model due to inherent randomness of model output…

If you say this is “night and day” difference, it should be apparent… I’m not sceptical per se, but “writing quality” is highly subjective and the model style may simply mesh better with your personal preferences?

BalorNG@alien.top · 2 years ago

There is no way it has “undiluted” 100k context. https://news.ycombinator.com/item?id=36374936

But yea, it IS impressive.

BalorNG@alien.top · 2 years ago

Given how good 7b Mistral is in my personal experience, it seems that a model 3x its size can BE GPT3.5 Turbo is no longer implausible.

BalorNG@alien.top · 2 years ago

Finally, a diffusion based LMM!