I’ve been using self-hosted LLM models for roleplay purposes. But these are the worst problems I face every time, no matter what model and parameter preset I use.
I’m using :
Pygmalion 13B AWQ
Mistral 7B AWQ
SynthIA 13B AWQ [Favourite]
WizardLM 7B AWQ
-
It messes up with who’s who. Often starts to behave like the user.
-
It writes in third person perspective or Narrative.
-
Sometimes, generates the exact same reply (exactly same to same text) back to back even though new inputs were given.
-
It starts to generate more of a dialogue or screenplay script instead of creating a normal conversation.
Anyone has any solutions for these?
7b is waay too dumb to be able to roleplay right now. 13b is the bare minimum for that specific task.
The exceptions I’d make are OpenHermes 2.5 7b and OpenChat 3.5 7b, both pretty good Mistral finetunes. I’d use them over a lot of 13b But are they approaching the level of the 34/70b? No, you can easily tell, but they’re not stupidly dumb anymore.
It’s because 7B sucks
It’s just a low-parameter problem. If you’ve got the RAM for it, I highly suggest dolphin-2_2-yi-34b. Especially now that koboldcpp has context shifting, you don’t have to wait for all that prompt reprocessing. Also be sure you’re using an instruct mode like Roleplay (which is Alpaca format) or whatever official prompt format that LLM uses.
I’m using the TheBloke/U-Amethyst-20B-GGUF and of all the popular models between 13b and 33b i found it’s the sweet spot. Not many regenerations needed and very good for roleplaying, not overdoing the storytelling and holds up the character card really well.
Upgrade to a 70b setup and watch your problems disappear, plus SillyTavern swiping.
Which 70b do you recommend? Any loras?
What is SillyTaver swiping?
I’m gonna say you can remedy this problem even with 7b or 13b models to an extent but you’re gonna need to shift most of your game’s logic to be handled in the backend side of your game (the programming side and database) and feed the model the gist of your game state representation in form of text (use templates or simple paraphrasing methods for this part) with each prompt.
For 1 and 2, apply grammar sampling to force LLM to start all his sentences with:
:
This will “force” the LLM to write dialogue as the specified character. Won’t work 100% of the times, but will become a very rare event.
where did you learn about this?
I’m really struggling to wrap my head around the intuition that goes into imposing grammars on LLM generation.
Very cool trick, thanks.
I use a custom front-end and append the character name / colon to the end of all my prompts to force this, I wonder if grammar sampling would be better. Don’t really have my head around grammar sampling yet.
Anyone has any solutions for these?
Use a high quality model.
That means not 7B or 13B.
I know a lot of other people have already said this in the thread, but this keeps coming up in this sub so I’m just gonna say it too.
Bleeding edge 7B and 13B models look good in benchmarks. Try actually using them and the first thing you should realize is how poorly benchmark results indicate real world performance. These models are dumb.
You can get started on runpod by depositing as little as $10, that’s less than some fast food meals, just take the plunge and find out for yourself. If you use an RTX A6000 48GB they’ll only charge you $0.79 per hour so you get quite a few hours of experimenting to feel the difference for yourself. With 48GB VRAM you can run Q4_K_M quants of 70B with full GPU offloading, or try Q5_K_M or even Q6 or Q8 if you tweak the number of layers you’re offloading to fit within 48GB (and still get fast enough generations for interactive chat.)
The difference is just absolutely night and day. Not only do 70Bs rarely make the basic mistakes you are describing, sometimes they even surprise me in a way that feels “clever.”
Can confirm what the other people in here are saying about 70b models having much less of these problems. At least that is my experience as well.
What you highlighted as problems are the reasons why people fork out money for the compute to run 34b and 70b models. You can tweak sampler settings and prompt templates all day long but you can only squeeze so much smarts out of a 7b - 13b parameter model.
The good news is better 7b and 13b parameter models are coming out all the time. The bad news is even with all that, you’re still not going to do better than a capable 70b parameter model if you want it to follow instructions, remember what’s going on, and stay consistent with the story.
No, the problems described are not representative of Mistral 7B quality at all. That’s almost certainly just incorrect prompting, format wise.
Since I started using 70B, I have never encountered these problems again. It is that much better.
I have a RTX 4090, 96GB of RAM and a i9-13900k CPU, and I still keep going back to 20b (4-6bpw) models due to the awful performance of 70b models, which 2.4bpw is supposed to fully fit the VRAM in… even using Exllama2…
What is your trick to get better performance? If I don’t use a small lame context of 2048, the speed of generating is actually un-usable (under 1 token/sec), what context are you using and what settings? Thank you.
That all sounds like the typical symptoms when you feed too much generated content back into the context buffer. Limit the dynamic part of your context buffer to about 1k tokens. At least that’s been my experience using 13B models as chatbots. With exllama you just add “-l 1280”. Other systems should offer similar functionality.
If you want to get fancy, you can fill the rest of the context with whatever backstory you want.