@tronathan

tronathan@alien.top · 1 year ago

Looks really nice - I watched the video demo and I can’t say that my coding experience really calls for any of the things in the demo. Most of what I deal with is managing integration of a large set of data models. The actual coding is the easy part, figuring out what to code is the hard part.

tronathan@alien.top · 1 year ago

I’ve been out of the loop for a bit, so despite this thread coming back again and again, I’m finding it useful/relevant/timely.

What I’m having a hard time figuring out is if I’m still SOTA with running text-generation-webui and exllama_hf. Thus far, I ALWAYS use GPTQ, ubuntu, and like to keep everything in RAM on 2x3090. (I also run my own custom chat front-end, so all I really need is an API.)

I know exllamav2 is out, exl2 format is a thing, and GGUF has supplanted GGML. I’ve also noticed a ton of quants from the bloke in AWQ format (often *only* AWQ, and often no GPTQ available) - but I’m not clear on which front-ends support AWQ. (I looked a vllm, but it seems like more of a library/package than a front-end.)

edit: Just checked, and it looks like text-generation-webui supports AutoAWQ. Guess I should have checked that earlier.

I guess I’m still curious if others are using something besides text-generation-webui for all-VRAM model loading. My only issue with text-generation-webui (that comes to mind, anyway) is that it’s single-threaded; for doing experimentation with agents, it would be nice to be able to run multi threaded.

tronathan@alien.top · 1 year ago

I use a custom front-end and append the character name / colon to the end of all my prompts to force this, I wonder if grammar sampling would be better. Don’t really have my head around grammar sampling yet.

tronathan@alien.top · 1 year ago

Gradio is a 70MB requirement

That doesn’t make it fast, just small. Inefficient code can be compact.

tronathan@alien.top · 1 year ago

Proxmox + Ubuntu FTW IMO