I have mistral-7b-openorca.Q5_K_M.gguf currently running in proxmox debian container with 8 CPU cores, 8 gb ram using llama.cpp python. Speed is slightly slower than what we get on bing chat but its absolutely usable/fine for a personal, local assistant. I have coded using llama.cpp python binding and exposed chat UI to a local url using Gradio python lib. This has been very useful so far as an AI assistant for big/small random requests from phone, pc, laptops at home. I am also using this from outside using Cloudflare tunnels(in a separate network which i use to expose services).
I also have a similar setup using llama.cpp (compiled for amd gpu) on a sightly powerful linux system where I have created a linux script based invoking for a different model. I call this script using linux alias “summon-{modelname}” in shell and model is ready to serve directly from command line for my questions.
I have mistral-7b-openorca.Q5_K_M.gguf currently running in proxmox debian container with 8 CPU cores, 8 gb ram using llama.cpp python. Speed is slightly slower than what we get on bing chat but its absolutely usable/fine for a personal, local assistant. I have coded using llama.cpp python binding and exposed chat UI to a local url using Gradio python lib. This has been very useful so far as an AI assistant for big/small random requests from phone, pc, laptops at home. I am also using this from outside using Cloudflare tunnels(in a separate network which i use to expose services).
I also have a similar setup using llama.cpp (compiled for amd gpu) on a sightly powerful linux system where I have created a linux script based invoking for a different model. I call this script using linux alias “summon-{modelname}” in shell and model is ready to serve directly from command line for my questions.