The exceptions I’d make are OpenHermes 2.5 7b and OpenChat 3.5 7b, both pretty good Mistral finetunes. I’d use them over a lot of 13b But are they approaching the level of the 34/70b? No, you can easily tell, but they’re not stupidly dumb anymore.
The exceptions I’d make are OpenHermes 2.5 7b and OpenChat 3.5 7b, both pretty good Mistral finetunes. I’d use them over a lot of 13b But are they approaching the level of the 34/70b? No, you can easily tell, but they’re not stupidly dumb anymore.
It’s just a low-parameter problem. If you’ve got the RAM for it, I highly suggest dolphin-2_2-yi-34b. Especially now that koboldcpp has context shifting, you don’t have to wait for all that prompt reprocessing. Also be sure you’re using an instruct mode like Roleplay (which is Alpaca format) or whatever official prompt format that LLM uses.
It really does and I’m using the smallest, Q2_K, which happens to be a little bit bigger than the 4_K_M 70b models, but will still fit on my layered 64 GB RAM / 8 GB VRAM setup with 4096 context. My speed is about 1500 ms/T.
You can run it off your CPU using koboldcpp and offload how ever many layers (that equals your GPU VRAM size) using --gpulayers 40
for example.
I store all mine on slow drives because no matter where you load it, RAM or VRAM, it gets fully loaded and the original file is forgotten about. And it’s not like the read speed of huge files is terrible, even on a spinning disk. Even if you overload your RAM and swap to disk, you’ll still be using your designated pagefile/swap drive rather than your LLM files drive.