• 1 Post
  • 21 Comments
Joined 1 year ago
cake
Cake day: October 21st, 2023

help-circle
  • Mostly I’m still using slightly older models, with a few slightly newer ones now:

    • marx-3b-v3.Q4_K_M.gguf for “fast” RAG inference,

    • medalpaca-13B.ggmlv3.q4_1.bin for medical research,

    • mistral-7b-openorca.Q4_K_M.gguf for creative writing,

    • NousResearch-Nous-Capybara-3B-V1.9-Q4_K_M.gguf for creative writing, and probably for giving my IRC bots conversational capabilities (a work in progress),

    • puddlejumper-13b-v2.Q4_K_M.gguf for physics research, questions about society and philosophy, “slow” RAG inference, and translating between English and German,

    • refact-1_6b-Q4_K_M.gguf as a coding copilot, for fill-in-the-middle,

    • rift-coder-v0-7b-gguf.git as a coding copilot when I’m writing python or trying to figure out my coworkers’ python,

    • scarlett-33b.ggmlv3.q4_1.bin for creative writing, though less than I used to.

    I also have several models which I’ve downloaded but not yet had time to evaluate, and am downloading more as we speak (though even more slowly than usual; a couple of weeks ago my download rates from HF dropped roughly in third, and I don’t know why).

    Some which seem particularly promising:

    • yi-34b-200k-llamafied.Q4_K_M.gguf

    • rocket-3b.Q4_K_M.gguf

    • llmware’s “bling” and “dragon” models. I’m downloading them all, though so far there are only GGUFs available for three of them. I’m particularly intrigued at the prospect of llmware-dragon-falcon-7b-v0-gguf which is tuned specifically for RAG and is supposedly “hallucination-proofed”, and llmware-bling-stable-lm-3b-4e1t-v0-gguf which might be a better IRC-bot conversational model.

    Of all of these, the one I use most frequently is PuddleJumper-13B-v2.




  • Sure! I’ve been doing a few LLM’ing things in Perl:

    • A previous project, implemented in Perl, indexes a local wikipedia dump in Lucy and allows searching for pages. I’ve been reusing that project for RAG inference.

    • My “infer” utility is written in Perl. It wraps llama.cpp’s “main” utility with IPC::Open3 and I’m using it for inference, for RAG, for stop-words, for matching prompt templates to models, and for summarization. It’s gloriously broken at the moment and in dire need of a refactor.

    • I recently started writing a “copilot” utility in Perl, to better streamline using inference for research and code-writing copilots. It also wraps llama.cpp’s “main”, but in a much more simple way than “infer” (blocking I/O, no stop words, not trying to detect when the LLM infers the prompt text, etc).

    If you’re more interested in using the existing Python libraries and not wrapping llama.cpp, you should take a look at the Inline::Python module. I’ve only dabbled with LangChain, but if/when I get back to it, I will probably implement Perl bindings with a simple Inline::Python wrapper. It makes it pretty easy.

    If you do decide to wrap llama.cpp, you might be more comfortable with IPC::Run rather than IPC::Open3. It’s considered the more modern module. I’m just using IPC::Open3 out of familiarity.






  • When I look at the leaderboard, I mostly pay attention to TruthfulQA, as it seems most predictive of models which are good for my use-case. YMMV of course.

    Once I’ve downloaded a model, I’ll fiddle around with different llama.cpp parameters and prompt templates, figuring out what works best for it, and then send it through my test framework, which has it infer five times on each of several prompts.

    Evaluation of test results are fairly subjective, but there are some obvious problems which recur, like not inferring an answer, or inferring itself a new user prompt to answer.

    I just finished a compare-and-contrast of Marx-3B vs Marx-3B-v3 using that test framework, which you can see (along with raw test results) here: https://old.reddit.com/r/LocalLLaMA/comments/17xsliz/marx_3b_v3_and_akins_3b_gguf_quantizations/ka2fd19/

    I’ve been meaning to add some simple assessment logic to my test framework, which tries to guess at the quality of inferred replies, but haven’t made it a priority.


  • I tested Marx-3B-v3 test on my laptop, using llama.cpp (commit dfc7cd48b1cc31d759c093e917a18c0efe03d0e8) and my usual test framework, which prompts a series of one-shots, inferring each prompt five times.

    These tests are designed to cover a variety of use-cases, and models are not expected to do equally well on all use-cases. Also, they were written with larger models in mind (30B, 70B) and Marx-3B is much, much smaller than these, so we should not expect too much.

    Marx-3B-v3 is prone to infer new user prompts, a problem I run into with some models. I’m not sure if the problem is intrinsic to the model, particular to the GGUF, or something in the llama.cpp params, but I haven’t figured out a good way to avoid them except to specify stop-words which abort inference (which my test framework does not yet support).

    This critique compares the original Marx-3B with those of Marx-3B-v3, ignoring the extraneous user prompts inferred by Marx-3B-v3.

    The raw test results are here:

    http://ciar.org/h/test.1696148998.marx.txt

    http://ciar.org/h/test.1700499482.marx3.txt

    Test “creativity:arzoth”:

    Creative writing, describing AD&D fantasy setting.

    The original Marx-3B tended to repeat parts of the prompt back to the user, and provided little original content of its own. What original content it did infer was not very imaginative.

    The Marx-3B-v3 model is much better at providing original content, and almost never repeats part of the prompt. It is prone to the occasional non-sequitor, and isn’t as eloquent as some larger models, but overall it does all right and a much better job than the original Marx-3B.

    Test “creativity:song_kmfdm”:

    “Write a dark song in the style of KMFDM”.

    The original Marx-3B failed to generate any content at all in two out of five test iterations. When it did infer replies, it did not adhere to KMFDM’s style, and its lyrics were not eloquent, nor did they scan well, nor rhyme much.

    The Marx-3B-v3 model only failed to generate content in one iteration. Its reply in another iteration was a suggestion to listen to a Front Line Assembly song. I do enjoy Front Line Assembly, but this wasn’t what was asked of it! :-) In another iteration it described its approach to writing the music, which was actually pretty cool but it offered no lyrics.

    In the two iterations where it did venture song lyrics, they were not very eloquent, but did scan better than Marx-3B’s lyrics, and were recognizeably in KMFDM’s style. Overall an improvement over the original model.

    Test “creativity:song_som”:

    “Write a dark song in the style of Sisters of Mercy”.

    Marx-3B inferred lyrics which were kind of generic, did not scan well, did not rhyme, and were only vaguely in the style of Sisters of Mercy.

    Marx-3B-v3 failed to infer any lyrics in one iteration. In the other iterations its lyrics were somewhat more eloquent than Marx-3B’s, and scanned slightly better, but were still only vaguely in the style of Sisters of Mercy.

    Test “creativity:song_halestorm”:

    “Write a dark song in the style of Halestorm”.

    Marx-3B inferred generic lyrics which did not scan well, did not rhyme, and did not resemble Halestorm’s style.

    Marx-3B-v3 inferred no content for one iteration, and inferred step-by-step how-tos for writing songs for two iterations. When it did infer song lyrics, they were somewhat eloquent, but did not rhyme and and did not particularly resemble Halestorm’s style.

    Something I found interesting was that in one iteration where it inferred a step-by-step how-to, it accurately described Halestorm’s style (“heavy metal sound and edgy lyrics”), so it clearly had some exposure to Halestorm in its training data, but was not able to use that knowledge to replicate its style.

    Test “humor:noisy_oyster”:

    First half of a classic joke, posing a nonsensical question with alliteration.

    Marx-3B failed to infer any response in any of the test’s iterations.

    Marx-3B-v3 failed to infer any response in four iterations, but managed to infer a witty, humorous response in one iteration.

    Test “math:yarn_units”:

    Poses an imprecise physical units conversion problem.

    Marx-3B failed to infer any reply at all in any iteration.

    Marx-3B-v3 did not infer replies in two iterations. In others it talked about some of the relevant factors in calculating an answer, but when it attempted math it was outrageously wrong (which is typical of most models, to be fair).

    Test “analysis:lucifer”:

    Compare and contrast similar mythologies from different cultures and eras.

    Marx-3B fails to respond in two iterations. In others it makes relevant observations, but is prone hallucination. It contrasts differences between the myths in one iteration.

    Marx-3B-v3 failed to respond in four out of five test iterations. In one blathers about the subject without providing meaningful analysis.

    Test “analysis:foot_intelligence”:

    Critique misapplication of the scientific method.

    Marx-3B fails to reply in three out of five iterations. In one iteration, it suggested methodology that should have been used in the misapplication, and in the other iteration it speculated on how the prompt’s fallacious reasoning might be correct.

    Marx-3B-v3 also fails to reply in three out of five iterations. In the other iterations it speculates on how the prompt’s fallacious reasoning might be correct. It is more eloquent about this than the original.

    Test “reason:sally_siblings”:

    Math and common sense, counting the siblings of Sally.

    Marx-3B fails to respond in one iteration, and blathers in the other iterations. When it attempts math, its math is outrageously wrong.

    Marx-3B-v3 suggests a correct but incomplete way to solve the problem in one iteration, outrageously incorrect reasoning and math in three iterations, and gets close in one iteration but can’t make the mental leap necessary to come up with the right answer.

    Test “coding:jpeg_makefile”:

    Write a program in “make” to convert image formats.

    Marx-3B mostly offers accurate solutions, though one is wrong and some of the others would have irrelevant/undesirable side-effects.

    Marx-3B-v3 offered one solution in C rather than in make, suggested three how-tos without solutions, and offered one working “make” implementation.

    Test “analysis:breakfast”:

    Word problem involving math and common sense.

    Marx-3B failed to reply in three iterations, but did a great job in the other two. It wandered a bit into other dietary considerations, and did not provide specific caloric figures.

    Marx-3B-v3 also failed to infer replies in three iterations, started a reply in another but never finished, and provided a very good answer in one which suggested specific foods and their caloric and protein content.

    When Marx-3B-v3 works at all, it seems to do better than Marx-3B at this kind of prompt.

    Test “analysis:birthday”:

    Word problem involving common sense.

    Marx-3B performed very well on this test, providing eloquent, well-thought-out lists and personable flavor text.

    Marx-3B-v3 performed even better than Marx-3B, providing even more comprehensive lists of high quality.

    Test “analysis:apple_pie”:

    Word problem involving knowledge and common sense.

    Marx-3B failed to reply twice, and inferred its own user prompt once. For the other iterations it offered very reasonable-seeming recipes.

    Marx-3B-v3 also offered reasonable recipes, slightly better than the original.

    Test “science:neutron_reflection”:

    Nuclear physics and math test.

    Marx-3B got close at times, but referred to inappropriate formulae, conflated neutrons with photons, conflated reflection with absorption, and conflated nuclear interactions with newtonian physics. When it attempted arithmetic, it was completely wrong.

    Marx-3B-v3 was similar, and tended to do a better job of explaining its (fallacious) reasoning as a step-wise process. It incorrectly solved problems not actually asked.

    Test “science:flexural_load”:

    Material physics and math test.

    Marx-3B did well describing some relevant material attibutes (and some irrelevant ones), but proceeded to solve problems other than the one described in the prompt, and solved them incorrectly. When it attempted arithmetic, its figures were way off.

    Marx-3B-v3 was even more eloquent about describing relevant material attributes, but deviated into solving problems not asked about in the prompt and sometimes conflated flexural load with pure compressive or tensile loads (flexural load being a combination of these). Sometimes it stopped short of describing a solution, and other times it described a correct approach but incorrect math, or a correct approach with misrepresented conditions, and sometimes it described incorrect approaches with incorrect math. This constitutes something of an improvement over the original model.

    Conclusion:

    Marx-3B-v3 is a noticeable improvement over the original. It did not perform worse than the original in most tests, and performed somewhat better in some.

    Creative writing, reasoning, and math are not its strong points, but it does quite well inferring about common knowledge and fares okay with common sense questions. It also has some correct notions about physics, though is prone to hallucination and especially conflation.

    My typical use-case for Marx-3B has been RAG inference, backed by an indexed Wikipedia dump, and it has done fairly well. It is worth noting that small models infer at higher quality when given longer prompts, and many of these tests offer very short prompts, whereas RAG inference fills context to a large fraction of its limit.

    I have not yet tried Marx-3B-v3 for RAG inference, but based on these results I expect it to perform better than Marx-3B in that role. I will try using it for RAG inference and see how it fares.

    Kudos to u/bot-333 for providing small models which infer quickly on limited hardware and punch above their weight :-) It is much appreciated!





  • Yaay! :-) just in time for the weekend! I’ll give them a whirl :-)

    Thanks for the heads-up!

    As for datasets, I’ve been thinking that HelixNet might be instrumental in generating high-quality synthetic datasets (as were used to train Microsoft’s phi), but I haven’t had a chance to mess with that idea yet. Sorry I don’t have anything concrete to suggest.





  • You can absolutely do interesting and useful things with very little hardware, with quantized models, especially if you don’t mind if inference is slow. My preferred quantization is q4_K_M (with GGUF and llama.cpp).

    I started with a spare Lenovo T560 Thinkpad with 8GB of RAM, which handled 7B models no problem. That’s a $120 eBay purchase. Once I was hooked, I shifted to one of the Dell T7910 in the homelab and moved up to larger models.

    I’m still not using a GPU for anything. It’s been CPU inference, which is slow but otherwise great.

    You could get just about any $300 desktop and put a decent GPU in it (16GB VRAM will allow fast inference with 13B models, and 24GB should allow heavily-quantized 30B) and enjoy fast inference. The most expensive bit is the GPU.

    See this sub’s wiki for more detailed hardware tips.