Messing around with Yi-34B based models (Nous-Capyabara, Dolphin 2.2) lately, I’ve been experiencing repetition in model output, where sections of previous outputs are included in later generations.
This appears to persist with both GGUF and EXL2 quants, and happens regardless of Sampling Parameters or Mirostat Tau settings.
I was wondering if anyone else has experienced similar issues with the latest finetunes, and if they were able to resolve the issue. The models appear to be very promising from Wolfram’s evaluation, so I’m wondering what error I could be making.
Currently using Text Generation Web UI with SillyTavern as a front-end, Mirostat at Tau values between 2~5, or Midnight Enigma with Rep. Penalty at 1.0.
I encounter this a lot with the Yi 34B models to the point where I’ve basically stopped using them for chat. I’ve tried a huge variety of settings, presets, quants, etc. I’ve used koboldcpp and text-generation-webui, I’ve used EXL2, GGML, and GPTQ. The issue appears consistently after the context grows past a certain size. Partial or entire messages will repeat. It will also get stuck where regenerating will always result in the same response unless drastic changes to settings are made and usually it just changes the message that it’s stuck on. Smaller changes to the settings will just result it in changing the wording slightly of the stuck message.
Did you try disabling the BOS token?
Yes, the BOS token is disabled in my parameters
No issues here, just a lot of confidence on certain tokens but overall very little repetition. I use Koboldcpp, Q5 K M. Dont abuse temp, the model seems to be exceedingly sensitive and the smallest imbalance breaks its flow. Try temp 0,9, rep pen 1.11, top k 0, min-p 0.1, typical 1, tfs 1.
I see, the model does tend to run a bit hot as-is. I’ll go ahead and try these settings out tomorrow.
I’ll have to try these settings, I have OPs problems too and I always have to crank the temperature up to get it to work. Then it gets schizophrenia a few messages later. Thanks!
High temp does more harm than good. I would suggest looking into what the other settings do before raising it, no matter the model
I pretty much gave up trying to make Yi based models actually use more then 4k context. And at that point I rather just use Lzlv 70b which is much smarter with better prose and knowledge.
The repetition issue pretty much makes the models unusable past the context where it breaks.
Agreed - I’m personally using 70B models at 2.4BPW EXL2 quants, as well. They hold up great even at a small quantization as long as sampling parameters are set correctly, and the models are subjectively more pleasant in prose (Euryale 1.3 and LZLV both come to mind).
At 2.4BPW, they fit into 24GB of VRAM and inference is extremely fast, and EXL2 also appears to be very promising as a quantization method. I believe the potential upsides are yet to be fully leveraged.
On EXL2, when it started doing that, I cranked the temp to 2.0 rather than using dynamic temperature. That made it go away. Going to try higher rep pen next and see what happens. I’m at 8k context and it’s doing it.
I had a high hopes for Yi-34B chat, but when I tried it I saw it is not very good.
70B models are better (well of course), but I think even some 20B models are better.
I am having better luck with 2.4BPW EXL2 quants of 70B models from Lone_Striker lately - Euryale 1.3, LZLV, etc.
Even at the smaller quants, they are quite strong at the correct settings. Easily comparable to a 34B at Q4_K_M, from my experience.