How do you get it to work with ExLlama or ExLlamav2?
It works beautifully with llama.cpp, but with GPTQ models the responses are always empty.
zephyr-7B-beta-GPTQ:gptq-4bit-32g-actorder_True:
{"emotion": "surprised", "affectionChange": 0, "location": "^", "feeling": "^", "action": [],"reply": "^"}
zephyr-7b-beta.Q4_K_M.gguf:
{"emotion":"surprised","affectionChange":0.5,"location":"office","feeling":"anxious","action":["looking around the environment"],"reply":"Hello! I'm Lilla, nice to meet you!"}
This is my grammar definition:
root ::= RoleplayCharacter
RoleplayCharacter ::= "{" ws "\"emotion\":" ws Emotion "," ws "\"affectionChange\":" ws number "," ws "\"location\":" ws string "," ws "\"feeling\":" ws string "," ws "\"action\":" ws stringlist "," ws "\"reply\":" ws string "}"
RoleplayCharacterlist ::= "[]" | "[" ws RoleplayCharacter ("," ws RoleplayCharacter)* "]"
Emotion ::= "\"happy\"" | "\"sad\"" | "\"angry\"" | "\"surprised\""
string ::= "\"" ([^"]*) "\""
boolean ::= "true" | "false"
ws ::= [ \t\n]*
number ::= [0-9]+ "."? [0-9]*
stringlist ::= "[" ws "]" | "[" ws string ("," ws string)* ws "]"
numberlist ::= "[" ws "]" | "[" ws string ("," ws number)* ws "]"
Do you need to “prime” the models using prompts to generate the proper output?
If you want speed, you’ll want to use Mistral-7B-OpenOrca-GPTQ with ExLLama v2, that’ll give you around 40-45 tokens per second. TheBloke/Xwin-MLewd-13B-v0.2-GGUF to trade speed for quality (llama.cpp)