Recent releases for exllamav2 brings working fp8 cache support, which I’ve been very excited to test. This feature doubles the maximum context length you can run with your model, without any visible downsides.
For models barely fitting (it screams as you stuff it onto your gpu), this makes a world of difference.
Below, I show the updated maximum context I get with 2.4 and 2.5 bpw models:
These are on desktop ubuntu, with a single 3090 powering the graphics. Memory consumption varies between 0.56-0.7gb, but the usage stayed at 0.56gb for my tests.
For testing, I iteratively loaded these models with increasing context until OOM. Vram usage does not increase once set. These results should be replicable on text-gen-webui when implemented.
2.4bpw
- 16k (1k = 1024 context length)
- No fp8 cache - 8k
- w/ speculative 4.0bpw - 10k (tinyllama is speculative draft model)
- w/ speculative 5.0bpw - 10k
- w/ speculative 6.0bpw - 7k
- w/ speculative 8.0bpw - 6k
2.5bpw
- 10k
- No fp8 cache - 5k
- w/ speculative 4.0bpw - 5k
- w/ speculative 5.0bpw - 4k
Speculative results
When running the chat.py example, the results are consistently ~30 t/s. For chat tests, that is consistently 1.5x the original speeds.
Most responses will range between 28-33 t/s. I have not found any 70B models with poor results yet. Normally on one 3090 it is 20 t/s.
Additional sampling?
The default loader for speculative will probably have to be the regular Exllamav2 loader. We would want the sampling methods that synergize with speculative sampling as shown in “Typical Acceptance” section from this infopage:
https://sites.google.com/view/medusa-llm
Higher tps
When setting repetition penalty from 1.1 to 1.0, the tokens per second for many simple prompt examples is often 2 or 3 times greater as seen in the speculative example, but generation is prone to repeating phrases.
I’m not sure if this setting is more important for low bpw models, or if 2x gain is considered consistent for 4.65bpw.
Draft model
It did not seem to matter if the 1B tinyllama speculative model was undertrained, or finetuned. It also did not seem to matter if tinyllama was 4, 5, 6, or even 3 BPW. They each worked to allow for 30t/s speeds.
Thoughts
20 tps only goes down to 17 tps? - When doing this, I don’t really notice a drop in t/s when inputting huge articles with 16k context in the chat.py example, maybe flash decoding is already supported?
Perplexity scores? - People will have benchmarked 70B 2.X models, with some being calibrated to wikitext. I think this is one of these models, which I ran perplexity tests in text-gen-webui: https://huggingface.co/turboderp/Llama2-70B-exl2
Usually, only base models and comparisons between equivalent parameter models are useful. But there are a lot of unknowns for performing proper comparisons. For instance:
For 2.5bpw,
- If I set it to stride 512 and length 512, I get a perplexity of 8.
- If the stride is 512 and length 2048, I get 5ppl. At what context length should 2.5 and 4.65 be compared…?
For 2.4bpw,
- I get 5.6ppl when the stride is 512 at length 2048.
Should we conclude somewhat that the 2.5bpw model is e.g:
5/3.4 = 47% different from the original model when already optimized for its specific specialization, while
2.4bpw is 5.6/3.4 = 65% different?
(I don’t know the perplexity score of 4.65bpw base model, so this couldn’t be the answer)
Worth using?
If you have a single 24gb gpu, it’s worthwhile to try at least once.
I’m not familiar enough with the 13b models to convince you that this is superior, I’m not planning to convince you at all. The above is just to help if you are choosing this lower bpw option.
If you want to try squeezing 70B in, here’s a few guidelines:
Windows: Uninstall previous drivers cleanly (try NVcleanstall) to avoid the any unwanted residue effects of the RAM swapping mechanism, (do not just downgrade) and install version <531, OR try the latest driver and do not allow other programs to suddenly consume resources, or an essential part of your model may be trapped in RAM.
Flash-attention-2 and fp8 kv cache should work now in windows with text-gen-webui. I haven’t tested it. These results should be replicable on windows, but I am not 100% on if Windows has a lower vram cap. On Linux, nvtop shows 23.85/24.00 GB, which seems like my maximum.
Try to get an idea of your maximum context by closing all programs, disabling browser hardware acceleration, loading a 2.4bpw in text-gen-webui with increasing context until OOM.
Double that for your expected maximum context with the fp8 cache.
For each 1k, it should be ~0.28GB, across all bpw models.
If I had a cpu with integrated graphics, I think I would get an extra 4k from my gpu. Don’t be surprised if you can get higher than the above results.
Thats all, hopefully you found this useful, thanks for reading! ヘ(◕。◕ヘ)
Thanks for the writeup. What’s your subjective experience with 2.4bpw or 2.5bpw models? Are they severely degraded, or still quite smart?
2.3 and 2.4 models work fine and are smart, I used Exllamav2_HF loader (not for speculative tests above) because I haven’t worked out the right sampling parameters.
I don’t really have anything to compare to. I rarely use 70b q4_k_m for summary (ram+vram), and use mistral on other devices, but only for writing stories. , but here are things I did:
You can create various stories, and ask the model to repeatedly modify them. (Make it so bad guy wins ; cont. story ; say this as much as possible)
Use it as an instruct model: ask it to create various poems and chain a memorable story from a task list.