Is there a way to prevent coherency degradation when using high levels of RoPE scaling?

tenmileswide@alien.top · 2 years ago

Is there a way to prevent coherency degradation when using high levels of RoPE scaling?

SomeOddCodeGuy@alien.top · 2 years ago

I’ve seen a couple of YARN models, but I honestly have no idea how to use them lol. That and the mistral models; they always want to load up at 32k tokens, but then coherency of the model just dies after 5k. I can’t find really clear instructions on what’s expected to get maximum context value from either, so I tend to just ignore using either at high context.