Honestly, a 4bit quantized version of the 220B model should run on a 192GB M2 Studio, assuming these models could even work with a current transformer/loader.
Honestly, a 4bit quantized version of the 220B model should run on a 192GB M2 Studio, assuming these models could even work with a current transformer/loader.
What makes this any different than the “base” Yi-34B-200k model?
Where can we see a description of what the model has been finetuned on (datasets used, Lora’s used, etc.) and/or your methods for doing so? I’m not finding any of this information in the model card or the substack link.
Text gen web ui. Let’s me use all model formats depending on what I want to test at that moment.
For llama2 models set your alpha to 2.65 when loading them at 8k.
The general suggestion is “2.5” but if you plot the formula on a graph, 8192 context aligns with 2.642, so 2.65 is more accurate than 2.5
If people started doing this with any regularity, nVidia would intentionally bork the drivers.