With a 24GB video card, single card system, what is the best LLM that utilizes Exllama2? (for RPG/Chat)

cleverestx@alien.top · 1 year ago

Why can we get a 20 - 34b version of this very capable Mistral?

cleverestx@alien.top · 1 year ago

I have a RTX 4090, 96GB of RAM and a i9-13900k CPU, and I still keep going back to 20b (4-6bpw) models due to the awful performance of 70b models, which 2.4bpw is supposed to fully fit the VRAM in… even using Exllama2…

What is your trick to get better performance? If I don’t use a small lame context of 2048, the speed of generating is actually un-usable (under 1 token/sec), what context are you using and what settings? Thank you.

cleverestx@alien.top · 1 year ago

So far with the local models, I’ve just done like storybook format, RPGing, without a game system, dice, rolls, etc, which I used to do with chat GPT…

Do you have a prompt template that works well for you that you would be willing to share that gamifies it?

cleverestx@alien.top · 1 year ago

With a 24GB video card, single card system, what is the best LLM that utilizes Exllama2? (for RPG/Chat)