and of course TheBloke already prepped everything for our fine consumption.
Had the same problem last night and I promptly deleted it.
I haven’t had any issues running these Yi models. I think they are really good personally.
You took a picture of nous capybara…
Yeah I am kinda petty lol.
There are in fact 3 different distillations: https://huggingface.co/collections/ByteWave/distil-yi-models-655a5697ec17c88302ce7ea1
Its not the 200K model though.
Which is a shame because the same performance + the extra context would have been huge.
Can’t wait!!!
Is there a code for distillation?
I had okayish results blowing up layers from 70b… but messing with the first or last 20% lobotomizes the model, and I didn’t snip more than a couple layers from any one place. By the time I got the model far enough down in size that q2_K could load in 24GB of VRAM it fell apart, so I didn’t consider mergekit all that useful of a distillation/parameter reduction process.
Oh yeah, it be busted.
Did anyone manage to get them working? I tried GGUF/GPTQ and running then unquantized with trust-remote-code and they just produced garbage. (I did try removing BOS tokens and still same thing)
Yeah, exactly the same thing. Produced absolutely rubbish whatever i tried. I tried 8B 15B and 23B
I’ve completely fixed gibberish output on Yi-based and other models by setting the RoPE Frequency Scale to a number less than one, which seems to be the default. I have no idea why that works, but it does.
What I find even more strange is the models often keep working after setting the frequency scale back to 1.
What value specifically worked?
did you test the model before advertising it?
Lmao