I have some basic confusions over how to prepare a dataset for training. My plan is to use a model like llama2 7b chat, and train it on some proprietary data I have (in its raw format, this data is very similar to a text book). Do I need to find a way to reformat this large amount of text into a bunch of pairs like “query” and “output” ?
I have seen some LLM’s which say things like “trained on Wikipedia” which seems like they were able to train it on that large chunk of text alone without reformatting it into data pairs - is there a way I can do that, too? Or since I want to target a chat model, I have to find a way to convert the data into pairs which basically serve as examples of proper input and output?
Very interesting topic. I have thought about this too. One idea that came to my mind was splitting your raw text into chunks, then ask a LLM to generate questions which the answers are these chunks and that way create an artificial dataset of QnA pairs. Of course the quality of the dataset relies on how well your structure your prompts to generate the questions.
Trained and finetuned - 2 things.
The trained on wikipedia - yes, they feed the wikipedia articles to it - hook and sinker. No Q/A. But that doesn’t mean it will be able to give you answer, unless you fine tune it with Q/A “I want you to behave like this” template - but the kick is - what we all are using to our huge advantage - it can be fine-tuned on a totally different Q/A, it will still be able to answer from wikipedia. It’s a hat trick.
Thanks for the information and explanation
I am new to LLMs (I normally train Image Models) so if this is a stupid question let me know.
I have been converting the shadowrun lore wiki into Q and A so i can use that model for a sillytavern character as a contact in my current tabletop game. Do I really need to convert it all to Q and A? If I get a better “Contact” I dont mind.
Sadly, I haven’t found any guides on dataset creation or preparation. It seems really sparse, when one would think knowing how to create a high quality dataset would be important for AI. It feels like a massive blindspot in the community at the moment. Unless I’m missing something, because I would love for there to be a trove of high quality, concise information that touches on the sort of things OP describes.
if you’re making a lora, training on wikipedia directly will pretty much make it output text that looks like wikipedia. which is to say it will (probably) be worse at chatting.
a strategy i’ve been using lately is to get gpt4 to make a conversation in my chosen format *about* each chapter of my “textbook”, i can automate this with pretty good results and it’s done in about 10 minutes. It does kind of work, it’ll at least get the bot to talk about the topics I chose, but as far as actually comprehending the information it’s referencing… it’s bad. It gets better as I increase rank, but it takes a lot of VRAM. I can only get to around 256 before it’ll die
please share!!
Go to huggingface and look at the multitude of datsets that have already been prepped and read whatever documentation and papers that have been published. Go through the data and get a sense of what the data looks like and how it’s structured.
Yea, doing this is part of what spurred the question, because I began to notice some datasets that were very clean and ordered into data pairs, and others that seemed formatted differently, and others still that seemed like they were fed a massive chunk of unstructured text. It made me confused on if there were some sort of standards or not that I was not aware of.
Below is a link to a sample i’ve put together for me recently to create a QA training dataset from source text with llamaindex dataset generator.
I’ve used Oobabooga with extension “openai” as inference api (with a zephyr 7b model).
It worked quite well to generate a dataset fully local. One should use a smaller + a larger model in service_context and service_context_large (which i didn’t so far).
Also you have to change the beginning where it currently only reads in a single file “output/Test.txt”. And maybe change chunk_size and num_questions_per_chunk.
The output json consists of “input” and “output” (which i did for a mistral model…). For llama based models i would maybe change it to “instruction”, “input” (=empty), “output”, “text” (=text chunk)
Please keep in mind that this is only a ugly early prototype that needs cleanup etc…
I had the same question a few weeks back and this blog post was really helpful for me: https://together.ai/blog/redpajama-data-v2 . The scripts used are also open sources on the github repo.
Awesome, thank you, at a glance this looks like it will be very helpful