I’m fascinated by the whole ecosystem popping up around llama and local LLMs. I’m also curious what everyone here is up to with the models they are running.
Why are you interested in running local models? What are you doing with them?
Secondarily, how are you running your models? Are you truly running them on a local hardware or on a cloud service?
I have mistral-7b-openorca.Q5_K_M.gguf currently running in proxmox debian container with 8 CPU cores, 8 gb ram using llama.cpp python. Speed is slightly slower than what we get on bing chat but its absolutely usable/fine for a personal, local assistant. I have coded using llama.cpp python binding and exposed chat UI to a local url using Gradio python lib. This has been very useful so far as an AI assistant for big/small random requests from phone, pc, laptops at home. I am also using this from outside using Cloudflare tunnels(in a separate network which i use to expose services).
I also have a similar setup using llama.cpp (compiled for amd gpu) on a sightly powerful linux system where I have created a linux script based invoking for a different model. I call this script using linux alias “summon-{modelname}” in shell and model is ready to serve directly from command line for my questions.