Helveticus99@alien.topB to

LocalLLaMAEnglish · 2 years ago

Dataset format for fine-tuning Llama-2 with axolotl on conversations

1

1

Dataset format for fine-tuning Llama-2 with axolotl on conversations

Helveticus99@alien.topB to

LocalLLaMAEnglish · 2 years ago

1

Hello

I’m using axolotl for fine-tuning Llama-2 13B on conversations. One possibility is to have the dataset in the following format:

# dataset.jsonl
{"text": "### Human: This is a question### Chatbot: This is a reply### Human: What the hell are you talking about?"}
{"text": "### Human: Who's coming tonight?### Chatbot: No one, it's literally Monday."}
...

Is it ok to use ### as a separate token between speakers or can I also use \n as a separator token? There are no line breaks in the turns.

Further, axolotl also provides the sharegpt format where the dataset would look as follows:

# dataset.jsonl 
{"conversations": [{"from": "Human", "value": "This is a question"}, {"from": "Chatbot", "value": "This is a reply"}, {"from": "Human", "value": "What the hell are you guys talking about?"}]}
{"conversations": [{"from": "Human", "value": "Who's coming tonight?"}, {"from": "Chatbot", "value": "No one, it's literally Monday."}]}
...

Is this correct usage of the sharegpt format and which of the two formats is better to use for fine-tuning on conversations?

Chat

YanaSss@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
But which format should be used if I want to be able to use the original llama prompt when creating my chatbot:

[INST] <>

System message.

<>

{prompt}[/INST]