First time testing local text model I don’t know much yet.I’ve seen people with 8GB cards complaining that text generation is very slow so I don’t have much hope about that but still… I think I need to do some configuration, when generating text my SSD is at 100% reading 1~2gb/s while my GPU does not reach 15% usage.
Using RTX 2060 6GB, 16GB RAM.
This is the model I am testing ( mythomax-l2-13b.Q8_0.gguf): https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/tree/main

  • Saofiqlord@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Your issue is using q8. Be real, you only have 6gb of vram, not 24.

    Your hardware can’t run q8 at a decent speed.

    Use q4_k_s, you can offload much more to gpu. There’s degradation yes, but its not so bad.

  • uti24@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    SSD is at 100% reading 1~2gb/s

    If your SSD swapping then model does not fit into RAM.

    Use smaller quant, like 4_K_M from your own link.

  • YearZero@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I think 13b Q8 is just cutting it really close with your 6GB vram and 16GB ram. You’d be much better of using the Q6 quant, and definitely anything below that would be ok.

    Look at the model card, TheBloke lists RAM requirements for each quant (without context). Since this model uses 4096 tokens for context, you would add another 1-2 gigs to the requirements.

    You might have some luck if you allocate the right amount in the parameters (as right now you’re allocating 0 to the GPU), but definitely play with lower quants, you wouldn’t even notice the quality loss until you get into maybe Q3.

    • OverallBit9@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Testing Q5 seems like the best at least for this GPU I use, but only on mythomax I’m not sure if other models would be the same.

  • Civil_Ranger4687@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Never use the Q_8 versions of GGUFs unless most/all of the model can comfortably fit into your VRAM. The Q_6 version is much smaller, and almost the same quality.

    For your setup, I would use mythomax-l2-13b.Q4_K_M.gguf.

    • OverallBit9@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      In my tests Q4 is giving me the same amount of tokens as Q5 so I decided to use Q5, first time tesint text gen locally with models, thank you very much for explaining I am getting used to it now and understanding what the settings do.

      • Civil_Ranger4687@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Yeah there’s so much to learn I’m still figuring a lot out too.

        Good tip for settings: Play around mostly with temperature, top-p, and min-p.

  • aseichter2007@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    So, you dont have enough ram to fit that model. It’s actually overrunning your ram entirely and using the wrong kind of vram, virtual ram, aka paged memory.

    Idk what you’re trying to do but the best answer is openhermes 2.5 mistral 7B Q3 and 4k context or similar or maybe Rocket 3B Q6 would be even faster.

    Hermes is king. I understand why you want that model, but 13bQ8 is huge, 17GBish memory at 8k context.

    it will speed up if you get it off the hard drive at least, try a Q3k_l if you’re determined to run mythomax.