So I was looking at some of the things people ask for in llama 3, kinda judging them over whether they made sense or were feasible.

Mixture of Experts - Why? This literally is useless to us. MoE helps with Flops issues, it takes up more vram than a dense model. OpenAI makes it work, it isn’t naturally superior or better by default.

Synthetic Data - That’s useful, though its gonna be mixed with real data for model robustness. Though the real issue I see is here is collecting that many tokens. If they ripped anything near 10T for openai, they would be found out pretty quick. I could see them splitting the workload over multiple different accounts, also using Claude, calling multiple model AI’s (GPT-4, gpt-4-turbo), ripping data off third party services, and all the other data they’ve managed to collect.

More smaller models - A 1b and 3b would be nice. TinyLlama 1.1B is really capable for its size, and better models at the 1b and 3b scale would be really useful for web inference and mobile inference

More multilingual data - This is totally Nesc. I’ve seen RWKV world v5, and its trained on a lot of multilingual data. its 7b model is only half trained, and it already passes mistral 7b on multilingual benchmarks. They’re just using regular datasets like slimpajama, they havent even prepped the next dataset actually using multilingual data like CulturaX and Madlad.

Multimodality - This would be really useful, also probably a necessity if they want LLama 3 to “Match GPT-4”. The Llava work has proved that you can make image to text work out with llama. Fuyu Architecture has also simplified some things, considering you can just stuff modality embeddings into regular model and train it the same. it would be nice if you could use multiple modalities in, as meta already has experience in that with imagebind and anymal. It would be better than GPT 4 is it was multimodality in -> multimodal out

GQA, sliding windows - Useful, the +1% architecture changes, Meta might add them if they feel like it

Massive ctx len - If they Use RWKV, they may make any ctx len they can scale to, but they might do it for a regular transformer too, look at Magic.devs (not that messed up paper MAGIC!) ltm-1: https://magic.dev/blog/ltm-1, the model has a context len of 5,000,000.

Multi-epoch training, Dr. Vries scaling laws - StableLM 3b 4e 1t is still the best 3b base out there, and no other 3b bases have caught up to it so far. Most people attribute it to the Dr Vries scaling law, exponential data and compute, Meta might have really powerful models if they followed the pattern.

Function calling/ tool usage - If they made the models come with the ability to use some tools, and we instruction tuned to allow models to call any function through in context learning, that could be really OP.

Different Architecture - RWKV is good one to try, but if meta has something better, they may shift away from transformers to something else.

  • dogesator@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago
    • MoE

    You gloss over “MoE just helps with FLOPS issues” as if that’s not a hugely important factor.

    So many people have a 16 or 24GB GPU, or even 64GB + Macbooks that aren’t being fully utilized.

    Sure people can load a 30B Q5 model into their 24GB GPU or a 70B Q5 model into their 48GB+ of memory in a macbook, but the main reason we don’t is because it’s so much slower, because it takes so much more FLOPS…

    People are definitely willing to sacrifice vram for speed and that’s what MoE allows you to do.

    You can have a 16 sub-network MoE with 100B parameters loaded comfortably into a macbook pro with 96GB of memory at Q5 with the most useful 4 subnetworks activated (25B params) for any given token,

    this would benchmark significantly higher than current 33B dense models when done right and act much smarter than a 33B model while also being around the same speed as a 33B model.

    Its all around more smarts for the same speed and the only downside is that it’s just using the extra VRAM that you probably weren’t using before anyways

  • thethirteantimes@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Well, as I said in a comment in another thread, an LLM that doesn’t descend into complete gibberish after around 25 messages would be nice, as would one that consistently understands at least the system prompt. Until we can do at least those two things this will only ever be a toy IMO.

  • Monkey_1505@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    it takes up more vram than a dense model.

    If you are using qlora, it’s not by much. The main issue is that you need another model to parse the prompt. But I could see this being useful sometimes. Maybe as an option though, rather than default

    That’s useful, though its gonna be mixed with real data for model robustness.

    I actually really don’t like synthetic data. It’s a great method for filtering large datasets, and perhaps augmenting them, but if you use purely synthetic data you are replicating inaccuracies and prose from the origin model that will only be exaggerated by the target model. I’d rather this was a quality control step, not a dataset producer.

    Multimodality

    I’m personally very eh about this. It has it’s uses, and I’ve used it. But if LLM intelligence has a long way to go and this could take focus away from that. Let that be a seperate project IMO. I’m sure it has it’s uses, and it’s fans, not knocking it - I just think open source is nessasarily already behind proprietary models, and mixed focus could just make that worse.

    Massive ctx len

    Because of the accuracy issues involved, I’d rather they worked on smarter data retrieval like openAI has (it doesn’t really have the context sizes quoted, it grabs out the relevant bits). Generally speaking for prompts, relevancy beats quantity.

  • SlowSmarts@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I made a mild wishlist in another thread - Cool things for 100k LLM

    If I were making an expensive LLM from scratch, these would be some of my thoughts before spending the dough:

    • A very large percentage of people use OSS LLMs for roleplay or coding, might as well just bake it into the base
    • Most coding examples and general programming data is years old and lacks knowledge of many new and groundbreaking projects and technologies; updated scrapes of coding sites needs to be made
    • Updated coding examples need to be generated
    • Longer coding examples are needed that can deal with multiple files in a codebase
    • Longer examples of summarizing code need to be generated (like book summing, but for long scripts)
    • Fine tuning datasets need a lot of cleaning from incorrect examples, bad math, political or sexual bias/agendas injected by wackjobs
    • Older math datasets seem way more error prone than newer ones
    • GPT-4 is biased and that will carry through into synthetic datasets, anything from it will likely taint the LLM, be it subtle; more creative dataset cleaning needed
    • Stop having datasets that contain stupid things like “As an AI…”
    • Excessive alignments is like sponsoring from birth a highly prized and educated genius, just to give them a lobotomy on graduation day
    • People regularly circumvent censorship and sensationalize “jailbreaking” it anyway, might as well leave the base model “uncensored” and advertise it as such
    • Cleaner datasets seems more important than maximizing the number of tokens trained
    • Multimodal and tool-wielding is the future, bake some cutting edge examples into the base

    Speaking of clean databases, have you checked out the new RedPajama-Data v2? There’s your 10T+ of clean dataset

  • mcmoose1900@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Massive ctx len

    There is a happy middle ground between he current 4K context and 5000K context.

    GPUs can handle ~32K-64K inference in the existing architecture just fine.

  • Feztopia@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Rwkv has its own weaknesses I don’t think that meta will go into that direction and that’s good because having different options is better.

  • FPham@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Not making it 180B so then I won’t be able to run it would be great for starters…

  • FaustBargain@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    if llama3 can get really really close to gpt4 after the best finetunes then I could really do some powerful autonomous agent stuff with it

    • dogesator@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Even if it can get just half way between gpt-3.5 vs 4… that would be big in my opinion