Well, not a total n00b as I play with LLM for almost year and a half now, but with local LLMs since summer. Although I have a profound experience with local image generators I thought I can use some of this knowledge with setting LLMs although it doesn’t seem to be that easy ;)
Any input that will shed some light on the problems I have will be greatly appreciated :)
Hardware:
Ryzen 9 3900X, 48GB RAM, RTX 4090
Oobabooga startup params:
--load-in-8bit --auto-devices --gpu-memory 23 --cpu-memory 42 --auto-launch --listen
I still have a problem getting around some issues, likely caused by improper loader settings.
I’m looking for some tips how to set them optimally. I use oobabooga UI as it’s the most comfortable for me and lets me test models before deploying them elsewhere (ie. to company UIs - I’m working on a chatbot connected to a vector db for local document storage and I thought about ooba as a backend for quick loading of models and setting parameters and exposing them via API) however It’s documentation is vague and I have a feeling that names for the parameters and so on are not standarized too. Which loader is optimal? ExLlama2_HF or AutoGPTQ? Latter pretty much always gives me issues :( and in ExLlama2 when I try to set longer context lenght and set alpha_value or compress_pos_emb it starts having trouble especially with repeating numbers ie it will say 190 instead of 1990 or 3137 instead of 31337 (but sometimes also with words shorting them in a strange way) - is that expected behaviour?
I would like to use context lenght that will be longer (4k or even 8k hardly cuts it) also I would like the LLM to generate longer replies - it’s not always necessary but sometimes it’s desired (ie for code generation) - usually instructing the model to “continue” helps, but longer answers would be nice.
BTW is the “max_position_embeddings” in the model’s config the same as the " max_seq_len" in the ExLlamav2 loader settings?
Or maybe you can just point me into some more advanced tutorial discussing these thing? All the stuff I find doesn’t delve into these things (just basic tutorials how to run oobabooga or other ui and they always use default configs).
Thanks for the informative answer. I will take a look at GGUF models (although I’m not sure yet how to split them between cpu/gpu yet (I will take a look at llama.cpp parameters).
You can find a n-gpu-layers slider, when you select llama.cpp. You can just input the max amount if you want everything on the GPU. Otherwise the model you loaded will say how many layers it has during loading in the terminal.