I would suggest you to use Koboldcpp and run GGUF. A 70B Q5 model, with around 40 layers loaded into GPU, should have more than 1t/s. At least for me, I got 1.5t/s with 4090 and 64GB ram using Q5_K_M.
- 0 Posts
- 3 Comments
Joined 2 years ago
Cake day: November 8th, 2023
You are not logged in. If you use a Fediverse account that is able to follow users, you can follow this user.
TuuNo_@alien.topBto LocalLLaMA•What is considered the best uncensored LLM right now?English1·2 years agohttps://github.com/ggerganov/llama.cpp/pull/1684 Higher parameter should be always better
Well, I have never used Linux before since the main purpose of my pc is gaming. But I heard running LLMs on Linux is overall faster.