@candre23

candre23@alien.top · 1 year ago

we have expanded the context window length to 32K

Kinda buried the lead here. This is far and away the biggest feature of this model. Here’s hoping it’s actually decent as well!

candre23@alien.top · 1 year ago

It’s a new foundational model, so some teething pains are to be expected. Yi is heavily based on (directly copied, for the most part) llama2, but there are just enough differences in the training parameters that default llama2 settings don’t get good results. KCPP has already addressed the rope scaling, and I’m sure it’s only a matter of time before the other issues are hashed out.

candre23@alien.top · 1 year ago

70b models will be extremely slow on pure CPU, but you’re welcome to try. There’s no point in looking on “torrent sites” for LLMs - literally everything is hosted on huggingface.

candre23@alien.top · 1 year ago

Yes, your GPU is too old to be useful for offloading, but you could still use it for prompt processing acceleration at least.

With your hardware, you want to use koboldCPP. This uses models in GGML/GGUF format. You should have no issue running models up to 120b with that much RAM, but large models will be incredibly slow (like 10+ minutes per response) running on CPU only. Recommend sticking to 13b models unless you’re incredibly patient.

candre23@alien.top · 1 year ago

All yi models are extremely picky when it comes to things like prompt format, end string, and rope parameters. You’ll get gibberish from any of them unless you get everything set up just right, at which point they perform very well.

candre23@alien.top · 1 year ago

It’s adorable that you think any 13b model is anywhere close to a 70b llama2 model.

candre23@alien.top · 1 year ago

No idea why you would need ~1800GB vram.

Homeboy’s waifu is gonna be THICC.

candre23@alien.top · 1 year ago

Extremely effective and definitely the quietest option, but requires a lot of space: https://www.printables.com/model/484282-nvidia-tesla-p40-120mm-blower-fan-adapter-straight

candre23@alien.top · 1 year ago

The ONLY pascal card worth bothering with is the P40. It’s not fast, but it’s the cheapest way to get a whole bunch of usable vram. Nothing else from that generation is worth the effort.

candre23@alien.top · 1 year ago

And Brockman just quit. Hell of a shakeup over there.

https://arstechnica.com/information-technology/2023/11/openai-president-greg-brockman-quits-as-nervous-employees-hold-all-hands-meeting/

candre23@alien.top · 1 year ago

Is this the beginning of the end of CUDA dominance?

Not unless intel/AMD/MS/whoever ramps up their software API to the level of efficiency and just-works-edness that cuda provides.

I don’t like nvidia/cuda any more than the next guy, but it’s far and away the best thing going right now. If you have an nvidia card, you can get the best possible AI performance from it with basically zero effort on either windows or linux.

Meanwhile, AMD is either unbearably slow with openCL, or an arduous slog to get rocm working (unless you’re using specific cards on specific linux distros). Intel is limited to openCL at best.

Until some other manufacturer provides something that can legitimately compete with cuda, cuda ain’t going anywhere.

candre23@alien.top · 1 year ago

GGUF I get like tops 4-5 t/s.

You’re doing something very wrong. I get better speeds than that on P40s with low context. Are you not using cublas?

candre23@alien.top · 1 year ago

Accurate.

candre23@alien.top · 1 year ago

The best noob-accessible explanation of LLMs I’ve found so far: https://blog.rfox.eu/en/Programming/How_to_run_your_own_LLM_GPT.html

The most entertaining (IMHO) explanation, which is (at best) 60% accurate: https://www.reddit.com/r/LocalLLaMA/comments/12ld62s/the_state_of_llm_ais_as_explained_by_somebody_who/

candre23@alien.top · 1 year ago

The 3090 will outperform the 4060 several times over. It’s not even a competition - it’s a slaughter.

As soon as you have to offload even a single layer to system memory (regardless of the speed), you cut your performance by an order of magnitude. I don’t care if you have screaming fast DDR5 in 8 channels and a pair of the beefiest xeons money can buy, your performance will fall off a cliff the minute you start offloading. If a 3090 is within your budget, that is the unambiguous answer.