Any Easy and Local Way to Run Benchmarks?

xadiant@alien.top · 2 years ago

Any Easy and Local Way to Run Benchmarks?

mattapperson@alien.top · 2 years ago

I am working on just such a tool… but it’s not ready yet. I am building a CLI tool that lets you just run `$ ai evals run humanevalsplus openhermes-2.5` and your good to go. Uses Llama.cpp

ResearchTLDR@alien.top · 2 years ago

I would also be interested in this! Especially if we could create new custom evals and load them in.

mattapperson@alien.top · 2 years ago

Yes custom evals are supported.

uhuge@alien.top · 2 years ago

What is needed to get it done? Can anyone help or only a few days of your focused time are expected to lead to it?

mattapperson@alien.top · 2 years ago

It’s just a side project for now in my free time. Started building it for my own sanity. But it’s not really in any shape that someone could just jump right in and help. So unless you’re a VC willing to throw money at me to make it my full time job lol… probably a couple weeks?

My goal is to make it not just a tool to run evals, but to create a holistic build, test, use toolkit to do everything from:

Cleaning datasets
Generating synthetic training data from existing data and files
Creating LoRAs and full fine tunes
Prompt evaluation and automated iterations
Running evaluations/benchmarks.

Trying to do all that in a way that is appreciable and easy to use and understand for your average software engineer, not just ai scientists. This stuff should require the setup of 20 libraries, writing all the glue code, or require knowing Python.

vikarti_anatra@alien.top · 2 years ago

I would be interested to use such thing (especially if it’s possible to pass custom options to llama.cpp and ask for custom models to be loaded).

Would it be possible to do something like this:

I put list of models: OpenHermes-2.5-Mistral-7B, Toppy-7B, OpenHermes-2.5-AshhLimaRP-Mistral-7B, Noromaid-v0.1.1-20B, Noromaid-v1.1-13B

Tool download every model from HF with every quantization, runs tests, and provide table with tests results (including failed ones)

mattapperson@alien.top · 2 years ago

This can kinda be done, but it’s not as simple as just that. You would need to also infer in many cases the prompt templates. Also many/most benchmarks are designed with untuned models in mind, meaning you typically need to add a system prompt/instructions… doing that also adds complexity because the best prompt for one model is likely different from the next. Also chat vs instruct vs base models in the same eval would be… meh. That said I think there is value in this and working on it as part of my cli tool with some warnings that the results might be less then quantitative