Is this accurate?
Is this able to use CPU (similar to llama.cpp)?
No chance of running this on P40s any time soon?
Hey he finally gets some recognition.
Can you offload layers with this like GGUF?
I don’t have much VRAM / RAM so even when running a 7B I have to partially offload layers.
God I cant wait until we’re past the command line era of this stuff
I’m the opposite. I shun everything LLM that isn’t command line when I can. Everything has it’s place. When dealing with media, GUI is the way to go. But when dealing with text, command line is fine. I don’t need animated pop up bubbles.
I’m the author of this article, thank you for posting it! If you don’t want to use Medium, here’s the link to the article on my blog: https://mlabonne.github.io/blog/posts/ExLlamaV2_The_Fastest_Library_to_Run%C2%A0LLMs.html
I’m a little surprised by the mention of
chatcode.py
which was merged intochat.py
almost two months ago. Also it doesn’t really require flash-attn-2 to run “properly”, it just runs a little better that way. But it’s perfectly usable without it.Great article, though. thanks. :)
Thanks for your excellent library! It makes sense because I started writing this article about two months ago (
chatcode.py
is still mentioned in theREADME.md
by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it’s still the case? I updated these two points, thanks for your feedback.
I wish there was support for metal with ExLlamav2. :(
I’ve tested pretty much all of the available quantization methods and I prefer exllamav2 for everything I run on GPU, it’s fast and gives high quality results. If anyone wants to experiment with some different calibration parquets, I’ve taken a portion of the PIPPA data and converted it into various prompt formats, along with a portion of the synthia instruction/response pairs that I’ve also converted into different prompt formats. I’ve only tested them on OpenHermes, but they did make coherent models that all produce different generation output from the same prompt.
In my experience it’s the fastest and llama.cpp is the slowest.
I think ExLlama (and ExLlamaV2) is great and EXL2’s ability to quantize to arbitrary bpw, and its incredibly fast prefill processing I think generally makes it the best real-world choice for modern consumer GPUs, however, from testing on my workstations (5950X CPU and 3090/4090 GPUs) llama.cpp actually edges out ExLlamaV2 for inference speed (w/ a q4_0 beating out a 3.0bpw even) so I don’t think it’s quite so cut and dry.
For those looking for max batch=1 perf, I’d highly recommend people run their own benchmarks at home on their own system and see what works (also pay attention to prefill speeds if you often have long context)!
My benchmarks from a month or two ago: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831
Thanks for sharing! I have been struggling with llama.cpp loader and GGUF (using oobabooga and the same LLM model), no matter how I set the parameters and how many offloaded layers to GPUs, llama.cpp is way slower to ExLlama (v1&2), not just a bit slower but 1 digit slower. I really don’t know why.
Does it run on Apple Silicon?
Based on the releases, doesn’t look like it. https://github.com/turboderp/exllamav2/releases
So how much vram would be required for 34b model or 14b model? I assume no cpu offloading right? With my 12gb vram, I guess I could only feed 14bilion parameters models, maybe even not that.
It’s not just great. It’s a piece of art.
Too bad I think that Windows support for it was lacking (at least, last time I checked it). It needs a separate thing to make it work properly, and this thing was only for Linux.
It works fine for me. I am also using a 3090 and text-gen-webui like Liquiddandruff.
Agreed. Best performance running GPTQ’s. Missing the HF samplers but that’s ok.
I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don’t personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can’t be long before there’s an update to expose those parameters in the UI.