Is this accurate?

  • CardAnarchist@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Can you offload layers with this like GGUF?

    I don’t have much VRAM / RAM so even when running a 7B I have to partially offload layers.

    • fallingdowndizzyvr@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I’m the opposite. I shun everything LLM that isn’t command line when I can. Everything has it’s place. When dealing with media, GUI is the way to go. But when dealing with text, command line is fine. I don’t need animated pop up bubbles.

    • ReturningTarzan@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I’m a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn’t really require flash-attn-2 to run “properly”, it just runs a little better that way. But it’s perfectly usable without it.

      Great article, though. thanks. :)

      • mlabonne@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Thanks for your excellent library! It makes sense because I started writing this article about two months ago (chatcode.py is still mentioned in the README.md by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it’s still the case? I updated these two points, thanks for your feedback.

  • llama_in_sunglasses@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I’ve tested pretty much all of the available quantization methods and I prefer exllamav2 for everything I run on GPU, it’s fast and gives high quality results. If anyone wants to experiment with some different calibration parquets, I’ve taken a portion of the PIPPA data and converted it into various prompt formats, along with a portion of the synthia instruction/response pairs that I’ve also converted into different prompt formats. I’ve only tested them on OpenHermes, but they did make coherent models that all produce different generation output from the same prompt.

    https://desync.xyz/calsets.html

    • randomfoo2@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I think ExLlama (and ExLlamaV2) is great and EXL2’s ability to quantize to arbitrary bpw, and its incredibly fast prefill processing I think generally makes it the best real-world choice for modern consumer GPUs, however, from testing on my workstations (5950X CPU and 3090/4090 GPUs) llama.cpp actually edges out ExLlamaV2 for inference speed (w/ a q4_0 beating out a 3.0bpw even) so I don’t think it’s quite so cut and dry.

      For those looking for max batch=1 perf, I’d highly recommend people run their own benchmarks at home on their own system and see what works (also pay attention to prefill speeds if you often have long context)!

      My benchmarks from a month or two ago: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831

      • tgredditfc@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Thanks for sharing! I have been struggling with llama.cpp loader and GGUF (using oobabooga and the same LLM model), no matter how I set the parameters and how many offloaded layers to GPUs, llama.cpp is way slower to ExLlama (v1&2), not just a bit slower but 1 digit slower. I really don’t know why.

  • JoseConseco_@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    So how much vram would be required for 34b model or 14b model? I assume no cpu offloading right? With my 12gb vram, I guess I could only feed 14bilion parameters models, maybe even not that.

  • ModeradorDoFariaLima@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Too bad I think that Windows support for it was lacking (at least, last time I checked it). It needs a separate thing to make it work properly, and this thing was only for Linux.

    • ViennaFox@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It works fine for me. I am also using a 3090 and text-gen-webui like Liquiddandruff.

  • lxe@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Agreed. Best performance running GPTQ’s. Missing the HF samplers but that’s ok.

    • ReturningTarzan@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don’t personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can’t be long before there’s an update to expose those parameters in the UI.