Use case is that I want to create a service based on Mistral 7b that will server an internal office of 8-10 users.
I’ve been looking at modal.com, and runpod. Are there any other recommendations?
I noticed TheBloke was using Massed Compute to quantize models. I’ve been poking around and using their hardware a bit more
Huge fan of modal, have been using them for a couple serverless LLM and Diffusion models. Can be definitely on the costly side, but like that the cost directly scales based on requests and setup is trivial.
recent project with modal: https://github.com/sshh12/llm-chat-web-ui/tree/main/modal
In our internal lab office, we’re using https://ollama.ai/ with https://github.com/ollama-webui/ollama-webui to locally host LLMs, docker compose provided by ollama-webui team worked like a charm for us.
Do you have hardware to serve the API or do you want to run this from the cloud?
Looking at cloud as an option. Don’t really have hardware now.
I can recommend vLLM. Also offers OpenAI compatible API service, if you want that.
Did you think about running out of a local m1 Mac mini? Ollama uses the Mac GPU out of the box.
WebAssembly based open source LLM inference (API service and local hosting): https://github.com/second-state/llama-utils
hmm cool. seems the size for the inference app only a few MBs
Just curious. What are you using it for?
Knowledge base, general GPT use, interaction with our CMS to add or update data.
Let us know what you end up going with op! I’m interested in something like this as well…