so I got this shiny new GPU and I want to push it to the limit. What’s the most powerful, smartest model out there? Ideally something with as much long-term memory as possible. I’m coming off of ChatGPT 4 and want something local and uncensored
A 4.x bit 70b model trained with 16k context with exllamav2 fits with room to spare. If you can add a 3090 or 4090 as well you can include a 6bit 32k 70b. That’s my standard inference setup and it covers a lot of ground.
I also have a 3080ti! I didn’t even know it was possible to combine them. Where can I go to learn how?
Hugging face will have some models you can download. Look at GitHub for llama and other repositories that will get you going. I just started playing around with local LLMs and it’s been an interesting journey mainly because I am using MacBook Pro - just the M2Pro with 16gig shared memory. So far I have been playing with 13b models with decent results. You may want to check out on GitHub localaivoicechat for something that will allow you to talk with your voice and have realtime voice playback of the generated ai response. There are many others there like oobabooga, sillytavern, text-generation-webui and many others. Start looking into those for now and it will open that can o worms for you. LM Studio is also something to check out. Good luck. It’s much easier and more compatible using PC based ai generating tools than on a Mac but so far I’m doing ok. You will have it much better and more compatible. Hope that helps a bit.
I’m also mucking about with A6000’s. 120b models quantised to 3bpw fit well with a 4K context in 48gb and are a lot of fun. Of course, soon you’ll realise how much more you can do with 2 a6000s…
What’s your favorite model that you’ve been using?
I’d try Goliath 120B and lzlv 70B. Those are the absolute best I’ve used, assuming you’re doing story writing / RP and stuff.
LZLV should be speedy as can be and easily done in VRAM.
Goliath won’t quite fit at 4 bit but you could do lower precision or sacrifice some speed and do q4_k_m GGUF with most of the layers offloaded. That’d be my choice, but I have a high tolerance for slow generation.
I’m willing to wait for quality so that’s no problem!
Where can I go to find these models? And how do I set them up and get them running?
If you’re on Windows, I’d download KoboldCPP and TheBloke’s GGUF models from HuggingFace.
Then you just launch KoboldCPP, select the .gguf file, select your GPU, enter the number of layers to offload, set the context size (4096 for those), etc and launch it.
Then you’re good to start messing around. Can use the Kobold interface that’ll pop up or use it through the API with something like SillyTavern.
I’ve been sparingly renting an A6000 lately, I wish I had one of my own. Just about any 70b model at 4-5bpw should work fine.
Then you realize,it would be great to have an a100 80gb.
for real! But I’m gonna enjoy what I have for now