- cross-posted to:
- localllama
- cross-posted to:
- localllama
You must log in or register to comment.
What would happen if you replace the decoder during finetuning? Would you also see a speed up but at the expense of vram?
Hmm, it looks like such a standard linear algebra optimisation that I’m surprised GPUs don’t do it automatically. But yep, looks good, either way.
Any chance P40s can benefit from this through llama.cpp?
This seems like this approach could also be useful in situations where the goal isn’t speed, but rather “quality” (by a variety of metrics).