@ThisGonBHard

ThisGonBHard@alien.top · 1 year ago

Nothing, sadly.

Models are trained on the questions, to improve performance, making the tests moot.

ThisGonBHard@alien.top · 1 year ago

One I used before is runpod.io, but it is a pay per time platform, not API.

ThisGonBHard@alien.top · 1 year ago

3090 might be faster/around the same speed, as they have NV-Link.

ThisGonBHard@alien.top · 1 year ago

I think it becomes faster to run on CPU than that.

ThisGonBHard@alien.top · 1 year ago

Pretty much not at all. The main bottleneck is memory speed.

I barely see a difference between 4 and 12 cores on 5900X when running on CPU.

When running multi GPU, the lanes are the biggest bottleneck.

On single GPU, CPU does not matter.

ThisGonBHard@alien.top · 1 year ago

I think it means no display in.

ThisGonBHard@alien.top · 1 year ago

While the benchmarks then to be cheated, especially by small models, I honestly think something is wrong with how you run it.

Yi-34B trades blows with Lllama 2 70B from my personal tests, making it do novel tasks invented by me, not the gamed benchmarks.

ALL 7B models are like putting a 7 year old vs an renowned professor when they are compared to 34B and 70B.

ThisGonBHard@alien.top · 1 year ago

Why the hell would you get a 2 gen old 16 GB GPU for 7.7K when you can get 3-4 4090s, each will rofl stomp it ANY use case, let alone running 3.

Get either an A6000 (Ampere 48GB card), A6000 ADA, 3 4090s and the a AMD TR system with it or something like that. It will still run laps around the V100 and be cheaper.

ThisGonBHard@alien.top · 1 year ago

https://github.com/oobabooga/text-generation-webui

How much ram do you have? It matters a lot.

For a BIF simplification, think of the models you can run as the size (billion parameter, for example 13B means 13 billion) = 50-60% of your RAM.

If you have 16 GB, you can run a 7B model for example.

If you have 128GB, you can run 70B,

ThisGonBHard@alien.top · 1 year ago

closed-source model

You gave your own answer:

Not monitored

Not controlled

Uncensored

Private

Anonymous

Flexible

ThisGonBHard@alien.top · 1 year ago

The whole AI ecosystem was pretty much designed for python from the ground up.

I am guessing you can run C# as the front end, and python as back end.

ThisGonBHard@alien.top · 1 year ago

I dont know if Exllama 2 supports Mac, but if it does, 70B.

ThisGonBHard@alien.top · 1 year ago

You dont talk about the “usuals”.

My go to models were for a long time Stable Beluga 2 13B and 70B.

Then, 13B got replaced by Minstral, 70B by LZLV, and Airoboros Yi 34B came out that worked great for me.

As a rule: 7B - CPU inferencing on 2-4 cores while using GPU.

34B and 70B, GPU inferencing, models trade blows despite size diff, as they are different base models. (Llama vs Yi).

ThisGonBHard@alien.top · 1 year ago

512MB RAM

When did they thaw you out of ice?!

Jokes aside, you probably mean 512 GB of RAM. That platform is slow and old, and that is at best DDR3 1333 dual channel, much worse than even bottom barrel DDR4 dual channel.

A 3090 will not care about it as long as you are doing pure GPU inferencing and not touching the GPU, if not DDR3 and PCI-E2 will kill the performance.

ThisGonBHard@alien.top · 1 year ago

512MB RAM

When did they thaw you out of ice?!

Jokes aside, you probably mean 512 GB of RAM. That platform is slow and old, and that is at best DDR3 1333 dual channel, much worse than even bottom barrel DDR4 dual channel.

A 3090 will not care about it as long as you are doing pure GPU inferencing and not touching the GPU, if not DDR3 and PCI-E2 will kill the performance.

ThisGonBHard@alien.top · 1 year ago

Both are bad choices:

IDK about the 3090 and the PSU, that thing can spike HARD, in the 1.6KW range, and if you PSU is a lower end one, it will kill it. I head a lot of people on 120V countries complain because it actually cause their lights to flicker.

The 4060 Ti is limited by PCI-E 8X at 2.0 speeds, but that is a XX50 chip masquerading as a 60, so it sips power.

3090 would be better, but you dont have the PC for it. Get a cheap used AMD B550 board for PCI-E 4.0, 64 GB of RAM and whatever CPU is within your remaining budget, all the way down from R5 3600 to R7 5800X3D, AM4 is really well segmented for price, even new. You will get better gaming performance too, and can run a lot of stuff in CPU with 64 GB of RAM.

ThisGonBHard@alien.top · 1 year ago

I think the API oobabooga is compatible with ChatGPT.

And you can get a 70B model running there in Exllama2 instead of 7B (22 t/s for me on a 4090), but are limited by the lower context of 70B.

ThisGonBHard@alien.top · 1 year ago

OpenAI was not gonna release shit for consumers either way, doomers are to scared of shadows to do it, and GPT3.5 was too advanced to make public by Illya.

Because Microsoft has GPT4 too, I am pretty sure they are just gonna continue working on what they were before as if nothing happened, under Microsoft, just now are not shackled and can go full steam ahead.

The doomers lost, because now the acceleration side is free and unshackled. At best, they bought 4 months, but progress might come 3x faster after those.

ThisGonBHard@alien.top · 1 year ago

Wow, you have one of the rare 2060 12 GB models. My best guess would be GGUF version, Try Q4 with maybe 25 layers offset in GPU. Make sure to close any apps, as you are gonna be really close to running out of RAM.

The Exllama2 4BPW (or kinda Q4 equivalent) model requires around 23 GB of VRAM as an reference point.

ThisGonBHard@alien.top · 1 year ago

Airboros-Yi 34B model seems to be the best one now, even over 70B.

It is creative, and quire diverse in stories.