• 0 Posts
  • 13 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle
  • There are several popular methods, all supported by the lovely mergekit project at https://github.com/cg123/mergekit.

    The ties merge method is the newest and most advanced method. It works well because it implements some logic to minimize how much the models step on each other’s toes when you merge them together. Mergekit also makes it easy to do “frankenmerges” using the passthrough method where you interleave layers from different models in a way that extends the resultant model’s size beyond the normal limits. For example, that’s how goliath-120b was made from two 70b models merged together.




  • EXL2 runs fast and the quantization process implements some fancy logic behind the scenes to do something similar to k_m quants for GGUF models. Instead of quantizing every slice of the model to the same bits per weight (bpw), it determines which slices are more important and uses a higher bpw for those slices and a lower bpw for the less-important slices where the effects of quantization won’t matter as much. The result is the average bits per weight across all the layers works out to be what you specified, say 4.0 bits per weight, but the performance hit to the model is less severe than its level of quantization would suggest because the important layers are maybe 5.0 bpw or 5.5 bpw, something like that.

    In short, EXL2 quants tend to punch above their weight class due to some fancy logic going on behind the scenes.






  • This was an insightful comment. The winnowing effect of market conditions should not be underestimated.

    I love the Wild West that is the local LLM scene right now, but I wonder how long the party will last. I predict that the groups with the capacity to produce novel, state-of-the-art LLMs will be seduced by profit to keep those models closed, and as those models that could run on consumer hardware become increasingly capable, the safety concerns (legitimate or not) will eventually smother their open nature. We may continue to get weights for toy versions of those new flagship models, but I suspect their creators will reserve the top-shelf stuff for their subscription customers, and they can easily cite safety as a reason for it. I can’t really blame them, either. Why give it away for free when you can become rich off your invention?

    Hopefully I’ll be proven wrong. 🤞 We’ll see…





  • What you highlighted as problems are the reasons why people fork out money for the compute to run 34b and 70b models. You can tweak sampler settings and prompt templates all day long but you can only squeeze so much smarts out of a 7b - 13b parameter model.

    The good news is better 7b and 13b parameter models are coming out all the time. The bad news is even with all that, you’re still not going to do better than a capable 70b parameter model if you want it to follow instructions, remember what’s going on, and stay consistent with the story.