you can try changing the attention to something like flash attention
you can try changing the attention to something like flash attention
so it sounds like for the 600b they just finetuned llama2 again with the same stuff Llama2 was trained with, just more of it…
RefinedWeb
Opensource code from GitHub
Common Crawl we fine-tuned the model on a huge dataset (generated manually and with automation) for logical understanding and reasoning. We also trained the model for function calling capabilities.
I hate that people are training on that garbage… but they’re ummmm… you know…
PRM8k, made the rounds maybe 6+ months but they never publicly released the model.
yea, that seems to be what a few news articles have referenced.
is there something other than the letter Q making you think it’s Q-learning?
What would happen if you replace the decoder during finetuning? Would you also see a speed up but at the expense of vram?
This is a giant cluster fuck.
Only if Sam sign a contract with MS, and it explicitly prevents that.
I will use this now for some tests.
I believe they said they’re going to release training data. We’ll see. That’s about the only way to easily verify what made it in.
Or 2 a6000s. But yea $$$ matters.