I’m only getting 0.8 tokens/second with my 3060 12gb using Zephyr 7b beta.
I’ll admit I barely know what I’m doing, but was I wrong to expect a little more? I was hoping for something at least a quarter the speed of gpt-3.5…
Run this with TGI or vLLM
What’s the latest t/s on a 4bit model with TGI? is there a difference compared with HF transformer loader?
The attention layers get replaced with flash attention 2, there’s kv caching as well so you get way better batch1 & batchN results with continuous batching for every request
What is TGI?
I get about 30 t/s on my 12Gb 4070Ti with Zephyr, so something is definitely borked. 0.8 is what I would expect from a 70b model running on CPU and system RAM. Make sure you’re offloading as many layers to GPU as your system can handle (in this case, all of them).
Sounds like you are executing that with CPU. when you do nvidia-smi, do you see memory and GPU consumption?