Hello

I’m using axolotl for fine-tuning Llama-2 13B on conversations. One possibility is to have the dataset in the following format:

# dataset.jsonl
{"text": "### Human: This is a question### Chatbot: This is a reply### Human: What the hell are you talking about?"}
{"text": "### Human: Who's coming tonight?### Chatbot: No one, it's literally Monday."}
...

Is it ok to use ### as a separate token between speakers or can I also use \n as a separator token? There are no line breaks in the turns.

Further, axolotl also provides the sharegpt format where the dataset would look as follows:

# dataset.jsonl 
{"conversations": [{"from": "Human", "value": "This is a question"}, {"from": "Chatbot", "value": "This is a reply"}, {"from": "Human", "value": "What the hell are you guys talking about?"}]}
{"conversations": [{"from": "Human", "value": "Who's coming tonight?"}, {"from": "Chatbot", "value": "No one, it's literally Monday."}]}
...

Is this correct usage of the sharegpt format and which of the two formats is better to use for fine-tuning on conversations?

  • YanaSss@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    But which format should be used if I want to be able to use the original llama prompt when creating my chatbot:

    [INST] <>

    System message.

    <>

    {prompt}[/INST]