People talk about it around here like this is pretty simple (these days at least). But once I hit about 4200-4400 tokens (with my limit pushed to 8k) all I get is gibberish. This is with the LLaMA2-13B-Tiefighter-AWQ model, which seems highly regarded for roleplay/storytelling (my use case).
I also tried OpenHermes-2.5-Mistral-7B and it was nonsensical from the very start oddly enough.
I’m using Silly Tavern with Oobabooga, sequence length set to 8k in both, and a 3090. I’m pretty new to all of this and it’s been difficult finding up to date information (because things develop so quickly!) The term fine-tuning comes up a lot, and with it comes a whooooole lot of complicated coding talk I know nothing about.
As a layman, is there a way to achieve 8k (or more) context for a roleplay/storytelling model?
If you never set rope base (or alpha) higher then it will just have stock context.
Does anyone have some hints how to use exllamav2 and extended context length by using GPTQ weights?
I’m wondering too. Openhermes 2.5 works fine for me on Oobabooga but it just stops outputting any tokens once it reaches 4k context despite having everything set for 8k (I’m running GGUF offloaded to gpu).
For llama2 models set your alpha to 2.65 when loading them at 8k.
The general suggestion is “2.5” but if you plot the formula on a graph, 8192 context aligns with 2.642, so 2.65 is more accurate than 2.5