The Problem with LLMs for chat or roleplay

tammmu@alien.top · 3 years ago

The Problem with LLMs for chat or roleplay

CocksuckerDynamo@alien.top · 3 years ago

Anyone has any solutions for these?

Use a high quality model.

That means not 7B or 13B.

I know a lot of other people have already said this in the thread, but this keeps coming up in this sub so I’m just gonna say it too.

Bleeding edge 7B and 13B models look good in benchmarks. Try actually using them and the first thing you should realize is how poorly benchmark results indicate real world performance. These models are dumb.

You can get started on runpod by depositing as little as $10, that’s less than some fast food meals, just take the plunge and find out for yourself. If you use an RTX A6000 48GB they’ll only charge you $0.79 per hour so you get quite a few hours of experimenting to feel the difference for yourself. With 48GB VRAM you can run Q4_K_M quants of 70B with full GPU offloading, or try Q5_K_M or even Q6 or Q8 if you tweak the number of layers you’re offloading to fit within 48GB (and still get fast enough generations for interactive chat.)

The difference is just absolutely night and day. Not only do 70Bs rarely make the basic mistakes you are describing, sometimes they even surprise me in a way that feels “clever.”