Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i’m not sure where the rest goes.

works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?

  • radianart@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Is here a better way to use bigger models than can fit in RAM\VRAM? I’d want to try 70b or maybe even 120b but I only have 32\8gb.

    • TheTerrasque@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      70b? Q4, llama.cpp, some layers on gpu.

      Might need to run Linux to get the system ram usage low enough