Could multiple 7b models outperform 70b models?

freehuntx@alien.top · 2 years ago

Could multiple 7b models outperform 70b models?

remghoost7@alien.top · 2 years ago

I believe this is what GPT4 actually is.

I remember reading somewhere that it’s actually a mix of 8 different models and it directs your question depending on the context of it.

Would be neat to implement on a local level though. Haven’t seen many people on the local side talk about doing this.

feynmanatom@alien.top · 2 years ago

Lots of rumors, but tbh I think it’s highly unlikely they’re using an MoE. MoEs work on batch size = 1 (you can take advantage of sparsity) but not on larger batch sizes. You would need so much RAM and would miss out on the point of using an MoE.

remghoost7@alien.top · 2 years ago

Lots of rumors…

Very true.

We honestly have no clue what’s going on behind ClosedAI’s doors.

I don’t know enough about MoEs to say one way or the other, so I’ll take your word on it. I’ll have to do more research on them.

FullOf_Bad_Ideas@alien.top · 2 years ago

Jondurbin made something like this with qlora.

The explanation that gpt-4 is MoE model doesn’t make sense to me. Gpt4 api is 30x more expensive than gpt-3-5-turbo. Gpt-3-5 turbo is 175B parameters, right? So, if they had 8 220B experts, it wouldn’t need to cost 30x more, it would be 20-50% more for API use. There was also some speculation that 3.5 turbo is 22B. In that case it also doesn’t make sense to me that it would be 30x as expensive.

Cradawx@alien.top · 2 years ago

No, several sources include Microsoft have said GPT 3.5 Turbo is 20B. GPT 3 was 175B, and GPT 3.5 Turbo was about 10x cheaper on the API than GPT 3 when it came out so it makes sense.

FullOf_Bad_Ideas@alien.top · 2 years ago

Yeah if that’s the case, I can see gpt-4 requiring about 220-250B of loaded parameters to do token decoding

AutomataManifold@alien.top · 2 years ago

Just to note: don’t read too much into OpenAI’s prices. They’re deliberately losing money as a market-capturing strategy, so it’s not guaranteed that there’s a linear relationship between what they charge for a given service and what their actual costs are.

vasileer@alien.top · 2 years ago

yes, this is done by Mixture of Experts (MoE)

and we already have this type of examples:

coding - deepseek-coder-7B is better at coding than many 70B models

answering from the context - llama2-7B is better than llama-2-13B at openbookqa test

https://preview.redd.it/1gexvwd83i2c1.png?width=1000&format=png&auto=webp&s=cda1ee16000c2e89410091c172bf4756bc8a427b

etc.

jxjq@alien.top · 2 years ago

Does this use of mixture-of-experts mean that multiple 70b models would perform ?better than multiple 7b models

vasileer@alien.top · 2 years ago

the question was if multiple small models can beat a single big model but also having the speed advantage, and answer is yes, and an example of that is MOE, which is a collection of small models all inside a single big model,

https://huggingface.co/google/switch-c-2048 is a such example

jxjq@alien.top · 2 years ago

Thank you for sharing, I understand now

extopico@alien.top · 2 years ago

big is an understatement. Please do correct me if I got it wildly wrong, but it appears to be a 3.6TB colossus.

OrdinaryAdditional91@alien.top · 2 years ago

Several days ago, WizardLM team promote an interesting work: https://x.com/WizardLM_AI/status/1727672799391842468?s=20, we could just utilize multiple models without re-train.

yahma@alien.top · 2 years ago

Yes. This is known as Mixture of Experts (MOE).

We already have several promising ways of doing this:

QMoE: A Scalable Algorithm for Sub-1-Bit Compression of Trillion-Parameter Mixture-of-Experts Architectures. Paper - Github
S-Lora: Serving thousands of concurrent adapters.
Lorax: Serve hundreds of concurrent adapters.
LMoE: Simple method of dynamically loading Loras

sampdoria_supporter@alien.top · 2 years ago

I can’t believe I hadn’t run into this. Would you indulge me on the implications for agentic systems like Autogen? I’ve been working on having experts cooperate that way rather than being combined into a single model.

feynmanatom@alien.top · 2 years ago

This might be pedantic, but this is a field with so much random vocabulary and it’s better for folks to not be confused.

MoE is slightly different. An MoE is a single LLM with gated layers that “select” which layers to route embeddings/tokens to. It’s pretty difficult to scale and serve in practice.

I think what you’re referring to is more like a model router. You can use a general LLM to “classify” a prompt and then route the entire prompt to a downstream LLM. It’s unclear if this would be faster than a 70B LLM since you would repeat the encoding phase and have some generation, but it could certainly be better.

wishtrepreneur@alien.top · 2 years ago

You can use a general LLM to “classify” a prompt and then route the entire prompt to a downstream LLM.

why can’t you just train the “router” LLM on which downstream LLM to use and pass the activations to the downstream LLMs? Can’t you have “headless” (without encoding layer) downstream LLMs? So inference could use a (6.5B+6.5B) params model with the generalizability of a 70B model.

feynmanatom@alien.top · 2 years ago

Hmm, not sure if I track what an encoding layer is? The encoding phase involves filling the KV cache across the depth of the model. I don’t think there’s an activation you could just pass across without model surgery + additional fine tuning.

extopico@alien.top · 2 years ago

Take a look here:

microsoft/JARVIS: JARVIS, a system to connect LLMs with ML community. Paper: https://arxiv.org/pdf/2303.17580.pdf (github.com)

and here to some extent, multimodal application:

moymix/TaskMatrix (github.com)

akbbiswas@alien.top · 2 years ago

Mixture of Experts where each expert has 7B parameters?

Maybe…

eggandbacon_0056@alien.top · 2 years ago

Still waiting for someone to use actual ensemble models and inference over all models and pick max or similar

DanIngenius@alien.top · 2 years ago

I really like the idea, i think multiple 13b models would be awesome! Managed by a highly configured routing model that is completely uncensored is something i want to do, i want to crowd fund a host with this, DM if you are interested!