• 0 Posts
  • 14 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle
  • Yes, the model directory is just all the files from a HF model, in one folder. You can download them directly from the “files” tab of a HF model by clicking all the little download arrows, or there’s huggingface-cli. Also git can be used to clone models if you’ve got git-lfs installed.

    It specifically needs the following files:

    • config.json
    • *.safetensors
    • tokenizer.model (preferable) or tokenizer.json
    • added_tokens.json (if the model has one)

    But it may utilize other files in the future such as tokenizer_config.json, so best just to download all the files and keep them in one folder.


  • There’s a bunch of examples in the repo. Various Python scripts for doing inference and such, even a Colab notebook now.

    As for the “usual” Python/HF setup, ExLlama is kind of an attempt to get away from Hugging Face. It reads HF models but doesn’t rely on the framework. I’ve been meaning to write more documentation and maybe even a tutorial, but in the meantime there are those examples, the project itself, and a lot of other projects using it. TabbyAPI is coming along as a stand-alone OpenAI-compatible server to use with SillyTavern and in your own projects where you just want to generate completions from text-based requests, and ExUI is a standalone web UI for ExLlamaV2.



  • Most of those security issues are just silly. Like, oh no, what if the model answers a question with some “dangerous” knowledge that’s already in the top three search results if you Google the exact same question? Whatever will we do?

    The other ones arise from inserting an LLM across where there would be a security boundary, like by giving it access to personal documents and at the same time an accessible interface to people who shouldn’t have that access. So a new, poorly understood technology provides novel ways for people to make bad assumptions in their rush to monetize it. News at 11.

    Of course it’s still a great segment and easily the most interesting part of the video.


  • To add to that: GPUs do support “conditional” matrix multiplication, they just don’t benefit from that type of optimization. Essentially, it takes as much time to skip a computation as it does to perform it. And in practice it can even take longer since the extra logic required to keep track of which computations to skip will add overhead.

    In order for this to make sense on a GPU you need a way of completely sidestepping portions of the model, like the ability to skip whole layers that are not relevant (a bit how MoE works already). If you have to load a weight from memory, or some sort of metadata to figure out what each individual weight is connected to, you’ve already allocated as many resources to that weight as you would if you simply used it in a streamlined matrix multiplication.

    The same also holds to a lesser extent for efficient CPU implementations that also rely on SIMD computations, regular memory layouts and predictable control flows.


  • I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don’t personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can’t be long before there’s an update to expose those parameters in the UI.




  • Well, it depends on the model and stuff, and how you get to that 50k+ context. If it’s a single prompt, as in “Please summarize this novel: …” that’s going to take however long it takes. But if the model’s context length is 8k, say, then ExUI is only ever going to do prompt processing on up to 8k tokens, and it will maintain a pointer that advances in steps (the configurable “chunk size”).

    So when you reach the end of the model’s native context, it skips ahead e.g. 512 tokens and then you’ll only have full context ingestion again after a total 512 tokens of added context. As for that, though, you should never experience over a minute of processing time on a 3090. I don’t know of a model that fits in a 3090 and takes that much time to inference on. Unless you’re running into the NVIDIA swapping “feature” because the model doesn’t actually fit on the GPU.





  • If anyone has suggestions, please let me know. Cheers!

    The suggestion I’d give, apart from finetuning, would just be to do some actual tests. Construct some scenarios that test the model’s ability to “show not tell” and so on, and contrast with smaller models and/or with a “null hypothesis” Frankenstein model where the added layers are just random matrices, etc.

    Ideally, if there’s nothing you can do to objectively measure the model’s performance, try to set up a blind test of some sort to see if users actually prefer the Frankenstein model over the two models it was spliced together from.

    Not to disparage the project or anything, but confirmation bias is a real thing, and it’s especially rampant in the LLM space.