https://arxiv.org/abs/2311.10770

“UltraFastBERT”, apparently a variant of BERT, that uses only 0.3% of it’s neurons during inference, is performing on par with similar BERT models.

I hope that’s going to be available for all kinds of models in the near future!

  • MoffKalast@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Would be interesting to see if this can help speed up CPU inference with regular RAM, after all 128 GB of DDR5 only costs like $300 which is peanuts compared to trying to get any where close as much VRAM.

    If it scales linearly then one could run a 100B model at the speed of a 3B one right now.

    • OnurCetinkaya@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I am just gonna do some bad maths.

      For the price of single 4090 you can get

      CPU Mainboard combo with 16 ram slots. $1,320

      16 x 32 GB ddr4 ram $888

      • total 512 GB

      Mistral 7B runs around 7 tokens per second on a regular CPU, that is like 5 words per second.

      On above setups 512 GB ram size we can fit a 512B parameters model, that will run 5*7/512=0.068 words per second with the current architecture, if this new architecture actually works and give 78x speed up it will be 5.3 words per second, the average persons reading speed is around 4 words per second. And average persons speaking speed is around 2 words per second.

      Fingers crossed this can put a small dent on Nvidia’s stock price.

      • MoffKalast@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        I doubt it, most of their leverage is in being the only suppliers of hardware required for pretraining foundational models. This doesn’t really change that.

      • fallingdowndizzyvr@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Fingers crossed this can put a small dent on Nvidia’s stock price.

        If it works that way, it will only be short term. Since the only reason it doesn’t run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they’ll will be back on top with the same margins again.

        Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.