I need a bit more info from people who installed Llama2 locally and using it to support web apps, or just local information.
- What is the ideal hardware for the 65b version?
- How many tokens can this hardware process per second, input, and output?
- Regarding safety, since it is used for business, what is the change that this model will end up arguing with the customer 😊 ?
You need a load balancer of some sort but an A6000 would be a good start. 15-20 tps as a single user.
In vanilla form, Llama 2 may do silly stuff. Instructs, tuning, etc. will decrease the likelihood.
If you are taking something to prod, I’d advise picking up a consultant to work with you.