What kind of performance should we expect?

cosmicr@alien.top · 2 years ago

What kind of performance should we expect?

DarthNebo@alien.top · 2 years ago

Run this with TGI or vLLM

Aaaaaaaaaeeeee@alien.top · 2 years ago

What’s the latest t/s on a 4bit model with TGI? is there a difference compared with HF transformer loader?

DarthNebo@alien.top · 2 years ago

The attention layers get replaced with flash attention 2, there’s kv caching as well so you get way better batch1 & batchN results with continuous batching for every request

dodo13333@alien.top · 2 years ago

What is TGI?

Pashax22@alien.top · 2 years ago

I get about 30 t/s on my 12Gb 4070Ti with Zephyr, so something is definitely borked. 0.8 is what I would expect from a 70b model running on CPU and system RAM. Make sure you’re offloading as many layers to GPU as your system can handle (in this case, all of them).

LostGoatOnHill@alien.top · 2 years ago

Sounds like you are executing that with CPU. when you do nvidia-smi, do you see memory and GPU consumption?