@ReturningTarzan

ReturningTarzan@alien.top · 1 year ago

Yes, the model directory is just all the files from a HF model, in one folder. You can download them directly from the “files” tab of a HF model by clicking all the little download arrows, or there’s huggingface-cli. Also git can be used to clone models if you’ve got git-lfs installed.

It specifically needs the following files:

config.json
*.safetensors
tokenizer.model (preferable) or tokenizer.json
added_tokens.json (if the model has one)

But it may utilize other files in the future such as tokenizer_config.json, so best just to download all the files and keep them in one folder.

ReturningTarzan@alien.top · 1 year ago

There’s a bunch of examples in the repo. Various Python scripts for doing inference and such, even a Colab notebook now.

As for the “usual” Python/HF setup, ExLlama is kind of an attempt to get away from Hugging Face. It reads HF models but doesn’t rely on the framework. I’ve been meaning to write more documentation and maybe even a tutorial, but in the meantime there are those examples, the project itself, and a lot of other projects using it. TabbyAPI is coming along as a stand-alone OpenAI-compatible server to use with SillyTavern and in your own projects where you just want to generate completions from text-based requests, and ExUI is a standalone web UI for ExLlamaV2.

ReturningTarzan@alien.top · 1 year ago

It’s not. If the only thing you’re using the P40 for is as swap space for the 3090, then you’re better off just using system RAM, since you’ll have to swap via system RAM anyway.

ReturningTarzan@alien.top · 1 year ago

Most of those security issues are just silly. Like, oh no, what if the model answers a question with some “dangerous” knowledge that’s already in the top three search results if you Google the exact same question? Whatever will we do?

The other ones arise from inserting an LLM across where there would be a security boundary, like by giving it access to personal documents and at the same time an accessible interface to people who shouldn’t have that access. So a new, poorly understood technology provides novel ways for people to make bad assumptions in their rush to monetize it. News at 11.

Of course it’s still a great segment and easily the most interesting part of the video.

ReturningTarzan@alien.top · 1 year ago

To add to that: GPUs do support “conditional” matrix multiplication, they just don’t benefit from that type of optimization. Essentially, it takes as much time to skip a computation as it does to perform it. And in practice it can even take longer since the extra logic required to keep track of which computations to skip will add overhead.

In order for this to make sense on a GPU you need a way of completely sidestepping portions of the model, like the ability to skip whole layers that are not relevant (a bit how MoE works already). If you have to load a weight from memory, or some sort of metadata to figure out what each individual weight is connected to, you’ve already allocated as many resources to that weight as you would if you simply used it in a streamlined matrix multiplication.

The same also holds to a lesser extent for efficient CPU implementations that also rely on SIMD computations, regular memory layouts and predictable control flows.

ReturningTarzan@alien.top · 1 year ago

I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don’t personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can’t be long before there’s an update to expose those parameters in the UI.

ReturningTarzan@alien.top · 1 year ago

I’m a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn’t really require flash-attn-2 to run “properly”, it just runs a little better that way. But it’s perfectly usable without it.

Great article, though. thanks. :)

ReturningTarzan@alien.top · 1 year ago

Notepad mode is up fwiw. It probably needs more features, but it’s functional.

ReturningTarzan@alien.top · 1 year ago

Well, it depends on the model and stuff, and how you get to that 50k+ context. If it’s a single prompt, as in “Please summarize this novel: …” that’s going to take however long it takes. But if the model’s context length is 8k, say, then ExUI is only ever going to do prompt processing on up to 8k tokens, and it will maintain a pointer that advances in steps (the configurable “chunk size”).

So when you reach the end of the model’s native context, it skips ahead e.g. 512 tokens and then you’ll only have full context ingestion again after a total 512 tokens of added context. As for that, though, you should never experience over a minute of processing time on a 3090. I don’t know of a model that fits in a 3090 and takes that much time to inference on. Unless you’re running into the NVIDIA swapping “feature” because the model doesn’t actually fit on the GPU.

ReturningTarzan@alien.top · 1 year ago

Notebook mode is almost ready. Probably I’ll release later today or early tomorrow.

ReturningTarzan@alien.top · 1 year ago

I’m working on a notepad mode for ExUI. It’s not quite ready, but probably sometime tomorrow.

ReturningTarzan@alien.top · 1 year ago

When you’re using non-instruct models for instruct-type questions, prompting is everything. For comparison, here are the first three questions put to Mistral-7B-instruct with correct prompt format at various bitrates up to FP16.

ReturningTarzan@alien.top · 1 year ago

If anyone has suggestions, please let me know. Cheers!

The suggestion I’d give, apart from finetuning, would just be to do some actual tests. Construct some scenarios that test the model’s ability to “show not tell” and so on, and contrast with smaller models and/or with a “null hypothesis” Frankenstein model where the added layers are just random matrices, etc.

Ideally, if there’s nothing you can do to objectively measure the model’s performance, try to set up a blind test of some sort to see if users actually prefer the Frankenstein model over the two models it was spliced together from.

Not to disparage the project or anything, but confirmation bias is a real thing, and it’s especially rampant in the LLM space.

ReturningTarzan@alien.top · 1 year ago

I agree. We need at least some anecdotal evidence to back up the anecdotal claims. There’s one screenshot on the model page which looks fine (although it mixes past and present tense), but it’s not output you couldn’t get from a 7B model with some deliberate sampling choices and/or cherrypicking.