Well, not a total n00b as I play with LLM for almost year and a half now, but with local LLMs since summer. Although I have a profound experience with local image generators I thought I can use some of this knowledge with setting LLMs although it doesn’t seem to be that easy ;)
Any input that will shed some light on the problems I have will be greatly appreciated :)
Hardware:
Ryzen 9 3900X, 48GB RAM, RTX 4090
Oobabooga startup params:
--load-in-8bit --auto-devices --gpu-memory 23 --cpu-memory 42 --auto-launch --listen
I still have a problem getting around some issues, likely caused by improper loader settings.
I’m looking for some tips how to set them optimally. I use oobabooga UI as it’s the most comfortable for me and lets me test models before deploying them elsewhere (ie. to company UIs - I’m working on a chatbot connected to a vector db for local document storage and I thought about ooba as a backend for quick loading of models and setting parameters and exposing them via API) however It’s documentation is vague and I have a feeling that names for the parameters and so on are not standarized too. Which loader is optimal? ExLlama2_HF or AutoGPTQ? Latter pretty much always gives me issues :( and in ExLlama2 when I try to set longer context lenght and set alpha_value or compress_pos_emb it starts having trouble especially with repeating numbers ie it will say 190 instead of 1990 or 3137 instead of 31337 (but sometimes also with words shorting them in a strange way) - is that expected behaviour?
I would like to use context lenght that will be longer (4k or even 8k hardly cuts it) also I would like the LLM to generate longer replies - it’s not always necessary but sometimes it’s desired (ie for code generation) - usually instructing the model to “continue” helps, but longer answers would be nice.
BTW is the “max_position_embeddings” in the model’s config the same as the " max_seq_len" in the ExLlamav2 loader settings?
Or maybe you can just point me into some more advanced tutorial discussing these thing? All the stuff I find doesn’t delve into these things (just basic tutorials how to run oobabooga or other ui and they always use default configs).
Model loaders: If you want to load a GPTQ model, you can use ExLlama 1 or 2. AutoGPTQ is old. I personally only use GGUF models, loaded in via Llama.cpp
Start-up parameter: I only use auto launch.
Context length: The normal length for Llama 1 based models is 2048, Llama 2 based (ethery model except new 7B models) is 4096 and Mistral (new 7B models) is 8192. You can use alpha rope and rope base to make more context usable. More VRAM is required. If you want to 2x your context (4k to 8k), you can put alpha rope to 2.5 and rope base to 25000. Do not use compress_pos.
Models: On 24GB you can fit any 7B and 13B model. 20B models are a thing, but not that great. Recently a few good 34B models have been released, but you won’t be able to run them with a high context window.
Thanks for the informative answer. I will take a look at GGUF models (although I’m not sure yet how to split them between cpu/gpu yet (I will take a look at llama.cpp parameters).
You can find a n-gpu-layers slider, when you select llama.cpp. You can just input the max amount if you want everything on the GPU. Otherwise the model you loaded will say how many layers it has during loading in the terminal.