Thinking about what people ask for in llama 3

vatsadev@alien.top · 3 years ago

Thinking about what people ask for in llama 3

dogesator@alien.top · 3 years ago

MoE

You gloss over “MoE just helps with FLOPS issues” as if that’s not a hugely important factor.

So many people have a 16 or 24GB GPU, or even 64GB + Macbooks that aren’t being fully utilized.

Sure people can load a 30B Q5 model into their 24GB GPU or a 70B Q5 model into their 48GB+ of memory in a macbook, but the main reason we don’t is because it’s so much slower, because it takes so much more FLOPS…

People are definitely willing to sacrifice vram for speed and that’s what MoE allows you to do.

You can have a 16 sub-network MoE with 100B parameters loaded comfortably into a macbook pro with 96GB of memory at Q5 with the most useful 4 subnetworks activated (25B params) for any given token,

this would benchmark significantly higher than current 33B dense models when done right and act much smarter than a 33B model while also being around the same speed as a 33B model.

Its all around more smarts for the same speed and the only downside is that it’s just using the extra VRAM that you probably weren’t using before anyways

thethirteantimes@alien.top · 3 years ago

Well, as I said in a comment in another thread, an LLM that doesn’t descend into complete gibberish after around 25 messages would be nice, as would one that consistently understands at least the system prompt. Until we can do at least those two things this will only ever be a toy IMO.

Monkey_1505@alien.top · 3 years ago

it takes up more vram than a dense model.

If you are using qlora, it’s not by much. The main issue is that you need another model to parse the prompt. But I could see this being useful sometimes. Maybe as an option though, rather than default

That’s useful, though its gonna be mixed with real data for model robustness.

I actually really don’t like synthetic data. It’s a great method for filtering large datasets, and perhaps augmenting them, but if you use purely synthetic data you are replicating inaccuracies and prose from the origin model that will only be exaggerated by the target model. I’d rather this was a quality control step, not a dataset producer.

Multimodality

I’m personally very eh about this. It has it’s uses, and I’ve used it. But if LLM intelligence has a long way to go and this could take focus away from that. Let that be a seperate project IMO. I’m sure it has it’s uses, and it’s fans, not knocking it - I just think open source is nessasarily already behind proprietary models, and mixed focus could just make that worse.

Massive ctx len

Because of the accuracy issues involved, I’d rather they worked on smarter data retrieval like openAI has (it doesn’t really have the context sizes quoted, it grabs out the relevant bits). Generally speaking for prompts, relevancy beats quantity.

SlowSmarts@alien.top · 3 years ago

I made a mild wishlist in another thread - Cool things for 100k LLM

If I were making an expensive LLM from scratch, these would be some of my thoughts before spending the dough:

A very large percentage of people use OSS LLMs for roleplay or coding, might as well just bake it into the base
Most coding examples and general programming data is years old and lacks knowledge of many new and groundbreaking projects and technologies; updated scrapes of coding sites needs to be made
Updated coding examples need to be generated
Longer coding examples are needed that can deal with multiple files in a codebase
Longer examples of summarizing code need to be generated (like book summing, but for long scripts)
Fine tuning datasets need a lot of cleaning from incorrect examples, bad math, political or sexual bias/agendas injected by wackjobs
Older math datasets seem way more error prone than newer ones
GPT-4 is biased and that will carry through into synthetic datasets, anything from it will likely taint the LLM, be it subtle; more creative dataset cleaning needed
Stop having datasets that contain stupid things like “As an AI…”
Excessive alignments is like sponsoring from birth a highly prized and educated genius, just to give them a lobotomy on graduation day
People regularly circumvent censorship and sensationalize “jailbreaking” it anyway, might as well leave the base model “uncensored” and advertise it as such
Cleaner datasets seems more important than maximizing the number of tokens trained
Multimodal and tool-wielding is the future, bake some cutting edge examples into the base

Speaking of clean databases, have you checked out the new RedPajama-Data v2? There’s your 10T+ of clean dataset

Illustrious-Lake2603@alien.top · 3 years ago

I just want it to be better than GPT4 at coding. It can be totally dumb in everything else.

mcmoose1900@alien.top · 3 years ago

Massive ctx len

There is a happy middle ground between he current 4K context and 5000K context.

GPUs can handle ~32K-64K inference in the existing architecture just fine.

vatsadev@alien.top · 3 years ago

Well the 5 million was just an example of the OP stuff out there

Jean-Porte@alien.top · 3 years ago

Even 200m would be great (among others)

Feztopia@alien.top · 3 years ago

Rwkv has its own weaknesses I don’t think that meta will go into that direction and that’s good because having different options is better.

vatsadev@alien.top · 3 years ago

It does have some, but the rwkv5 architecture is about as good as llama 2

FPham@alien.top · 3 years ago

Not making it 180B so then I won’t be able to run it would be great for starters…

FaustBargain@alien.top · 3 years ago

if llama3 can get really really close to gpt4 after the best finetunes then I could really do some powerful autonomous agent stuff with it

dogesator@alien.top · 3 years ago

Even if it can get just half way between gpt-3.5 vs 4… that would be big in my opinion