Is this normal behavior?
I’m still learning but I noticed that if I load a normal LLM like https://huggingface.co/teknium/OpenHermes-2-Mistral-7B it will take all the VRAM available (I have a 3080 10GB).
But when I load the quantized model like https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF it will take almost nothing of the VRAM, maybe like 1GB?
Is this normal behaviour?
You must log in or register to comment.
Update: I just saw that I had the GPU layers at 0, so it was running all in CPU then?
The slider goes from 0 to 128, how do I know what to pick?for cpu only it is not viewable due to mmap-loading which saves time during startup. to view, use --no-mmap