Hello
I’m using axolotl for fine-tuning Llama-2 13B on conversations. One possibility is to have the dataset in the following format:
# dataset.jsonl
{"text": "### Human: This is a question### Chatbot: This is a reply### Human: What the hell are you talking about?"}
{"text": "### Human: Who's coming tonight?### Chatbot: No one, it's literally Monday."}
...
Is it ok to use ##
as a separate token between speakers or can I also use \n
as a separator token? There are no line breaks in the turns.
Further, axolotl also provides the sharegpt format where the dataset would look as follows:
# dataset.jsonl
{"conversations": [{"from": "Human", "value": "This is a question"}, {"from": "Chatbot", "value": "This is a reply"}, {"from": "Human", "value": "What the hell are you guys talking about?"}]}
{"conversations": [{"from": "Human", "value": "Who's coming tonight?"}, {"from": "Chatbot", "value": "No one, it's literally Monday."}]}
...
Is this correct usage of the sharegpt format and which of the two formats is better to use for fine-tuning on conversations?
But which format should be used if I want to be able to use the original llama prompt when creating my chatbot:
[INST] <>
System message.
<>
{prompt}[/INST]