I have some basic confusions over how to prepare a dataset for training. My plan is to use a model like llama2 7b chat, and train it on some proprietary data I have (in its raw format, this data is very similar to a text book). Do I need to find a way to reformat this large amount of text into a bunch of pairs like “query” and “output” ?
I have seen some LLM’s which say things like “trained on Wikipedia” which seems like they were able to train it on that large chunk of text alone without reformatting it into data pairs - is there a way I can do that, too? Or since I want to target a chat model, I have to find a way to convert the data into pairs which basically serve as examples of proper input and output?
Sadly, I haven’t found any guides on dataset creation or preparation. It seems really sparse, when one would think knowing how to create a high quality dataset would be important for AI. It feels like a massive blindspot in the community at the moment. Unless I’m missing something, because I would love for there to be a trove of high quality, concise information that touches on the sort of things OP describes.