Running full Falcon-180B under budget constraint

mrobo_5ht2a@alien.top · 3 years ago

Dead_Internet_Theory@alien.top · 3 years ago

That is absolutely impressive, but:

is light quantization that bad? Couldn’t you run 99% of the same model for half the cost? Is running unquantized just a flex/exercise/bragging right?
Would quantized run faster? Slower? The same?
Isn’t Falcon-180B kinda… meh? I mean it’s pretty smart from size alone, but the lack of fine tuning by the community means it’s kind of like running LLaMA-70b by itself.
Would one of those new crazy good Threadrippers beat the GPUs? lol

mrobo_5ht2a@alien.top · 3 years ago

It’s not bad at all! I just wanted to see full model. The approach can be applied to quantized models too, I just wanted the most extreme example in terms of model and context size. It only gets better from there! Light quantization + speculative decoding gets you close to real-time.
Quantized would run significantly faster, although I haven’t measured it extensively yet. That is because you avoid most of the data transfer and also the layers take a lot less memory and run much faster themselves.
The model is definitely not the best, but what was important for me was to see something that’s close to GPT-3.5 in terms of size. So I have a blueprint for running newer open source models of similar sizes.

JstuffJr@alien.top · 3 years ago

I bet you are really wishing OAI had gone ahead with their briefly considered idea of releasing GPT-3 open source on dev day.

mrobo_5ht2a@alien.top · 3 years ago

You got me there 😊

WinstonP18@alien.top · 3 years ago

As for point #3, have you tried Goliath-120B? If yes, how would you rate it against Falcon-180B?

mrobo_5ht2a@alien.top · 3 years ago

I haven’t ran the full Goliath yet. Soon 😊

WinstonP18@alien.top · 3 years ago

I see. Please update us when you do, thanks in advance!

What is the goal