From the issue about this in the exllamav2 repo, quip was using more memory and slower than exl. How much context can you fit?
From the issue about this in the exllamav2 repo, quip was using more memory and slower than exl. How much context can you fit?
I’m not getting a super huge jump with the bigger models yet. Just a mild bump. I got a P100 to load the low 100s and have exllama work. That’s 64g of FP16 using vram.
For bigger I can use FP32 and put back the 2 more P40s. That’s 120g of vram. Also 6 vidya cards :P
It required building for this type of system from the start. I’m not made of money either, I just upgrade it over time.
It really is christmas.
I got a P100 for like $150 to see how well it will work with exllama + 3090s and if it is any faster at SD.
These guys are all gone already.
Would be cool to see this in a 34b and 70b.
Aren’t there people selling such services to companies here? Implementing RAG, etc.
Heh, 72b with 32k and GQA seems reasonable. Will make for interesting tunes if it’s not super restricted.
That’s a good sign if anything.
one is not enough
Does it give refusals on base? 67B sounds like full foundation train.
Something is wrong with your environment. even P40s give more than that.
Other option is you don’t get enough tokens to get proper t/s speed. What was the total inference time?
Welcome to the beginning of the death of shared reality. It’s on the chopping block after objective truth. The latter is almost done.
GS and SG merge different models.
I just got a P100 for like $150, going to test it out and see how it does with its FP16 vs P40 for SD and exllama overflow.
4060 is faster but its multiple times as expensive. For your sole GPU you really need 24gb+. The AMD are becoming somewhat competitive but still have some hassle and slowness.
CPU is going to give you 3t/s, its not really anywhere near, even with the best procs. Sure get it for other things in the system, but don’t expect it to help much with ML. I guess newer will get you faster ram but it’s not enough.
Wonder how L1 65b would do with L2 70b.
Pretty cool hack. Beats CPU inference at those speeds for sure.
Maxwell is pretty dead.
P40, 3090, those are your “affordable” 24gb GPU unless you want to go AMD or have enough to make 3x16gb or something.
Let the merging begin!
Good luck. Centrism is not allowed. You would have to skip the last decade of internet data. Social engineering works for both people and language models much the same.