I want to create a fine-tuning dataset that I can use with several models through Axolotl (like Mistral, Llama 2, and Falcon) to improve the model’s ability to extract requested information from a paragraph and output that in JSON format. I am using lm-format-enforcer to force JSON output.
Here is an example of the type of prompt I have been trying so far:
<s>[INST] <>
You are a helpful, respectful and honest assistant.
<>
Please give me information about this call log. If you cannot find the information you need, put N/A for that field. Any apostrophe must be escaped with a \ character. You MUST answer using the following json schema: {"properties":{"company_name":{"title":"Company Name","type":"string"},"country_or_countries":{"title":"Country Or Countries","type":"string"},"total_amount_due":{"title":"Total Amount Due","type":"integer"},"pending_task":{"title":"Pending Task","type":"boolean"}},"required":["company_name","country_or_countries","total_amount_due","pending_task"],"title":"AnswerFormat","type":"object"} Call log 2023-10-01 11:50:30 talked with Jim at Acme Construction. The job in Toronto is held up waiting for our sign off on the contract. We also need to put in a down payment of $10,000 plus the inspector fee of $750, and a license fee of $1500. The payment must be made by the end of November. I told him I'll call him back when it's done. [/INST]
Expected output:
{"company_name":{"title":"Company Name","value":"Acme Construction"},"country_or_countries":{"title":"Country Or Countries","value":"Canada"},"total_amount_due":{"title":"Total Amount Due","value":"12250"},"pending_task":{"title":"Pending Task","value":"TRUE"}}
I’m looking for some tips from people with more experience in prompt engineering. Here are some of my main questions:
-
Is this format with SYS and INST a reasonable idea for formatting a fine-tuning dataset? Especially since I want to fine-tune different base models with this same training dataset, do I need to strip some of that formatting out of the fine-tuning dataset to keep it more “model format neutral” and then add the formatting back in somehow for each model? What’s the best practice for fine-tuning dataset formatting in this regard?
-
Should I even include a SYS message at all? Instead of the generic “You are a…assistant”, should I uses the SYS section to specify that I want the results in JSON format? Or should I just remove the SYS section?
-
The goal with the fine-tuning dataset is to have at least one thousand examples of prompts and outputs in JSON format, but I want to vary the type of text being analyzed and the JSON schema in the examples to help it generalize better. In other words, I won’t always ask it to find the same information in the text and I won’t always use the same type of text. Should I also vary the way I write up the initial part about if you can’t find the info then write N/A and that sort of stuff?
Thanks for reading and if you need to clarify anything, to hesitate to ask.</s>
Not a prompt engineer here, but I would first of all simplify your json output. If you need this specific json then add a later step which transforms the simpler json to your exact json. Your current json has a lot of repetitions etc which make it hard to set correct parameters for your llm