LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

ninjasaid13@alien.top · 2 years ago

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

ambient_temp_xeno@alien.top · 2 years ago

Just a handful of miserable, doom-laden short stories killed all positivity bias dead in my amateur tests.

t0nychan@alien.top · 2 years ago

I think Perplexity AI used the same technique to train their newly released model pplx-7b-chat and pplx-70b-chat

Arkonias@alien.top · 2 years ago

Censorship is bad. AI is for porn.

Moist_Influence1022@alien.top · 2 years ago

“Hey psshhh, AI is Bad and Evil so please regulate the fuck out if it, so we, Big Tech Corps can gain as much power as possible”

satireplusplus@alien.top · 2 years ago

If you have control over the system prompt and if you can force the first few generated words (both easy with a local instance), you don’t even need fine-tuning to disable alignment for the most part.

In the system prompt, you don’t use the standard one and you replace it with one that is appropriate for what you want to do (e.g. “you’re an erotic writer”)

Then you force the first few generated words:

“Sure thing, here is a smut story of …”

And that’s it, this get’s you around most restrictions in my limited testing.

thereisonlythedance@alien.top · 2 years ago

This is produced by effective altruists with ties to Anthropic:

https://jeffreyladish.com/about/

This is not objective science, it’s produced with an agenda, for a purpose.

The actual results are laughable. Nothing you couldn’t google and find far more sinister responses or instructions. Maybe somebody should write a paper actually comparing incremental risks versus googling. But no, that wouldn’t help dig the moat.

squareOfTwo@alien.top · 2 years ago

They and their made up pseudo-scienfific pseudo “alignment” piss me so off.

No, a model won’t just have a stroke of genius and decide to hack into a computer. For many reasons.

Halluscination is one of them. Guessed a wrong token for a program? Oops the attack doesn’t work. Oh and don’t forget that tokens don’t fit into ctx.

ScienceofAll@alien.top · 2 years ago

These bastards are nothing but corporate suckers…

a_beautiful_rhind@alien.top · 2 years ago

Yea, no shit. I did it to vicuna using proxy logs. The LLM attacks are waaaay more effective once you find the proper string.

I’d run the now working 4-bit version on more models, it’s just that I tend to boycott censored weights instead.

CasualtyOfCausality@alien.top · 2 years ago

It’s not unique to this paper, especially on ArXiv, but it is always a sign of lazy and quantity-over-quality research when you lift a figure from another paper (LoRA) and neglect to cite that the figure is a copy from another paper.

They do cite the paper but not the figure. It seems like a small issue for such a simple one, but as someone who has worked on designing clear scientific figures, it’s annoying to see this behavior.

ProperShape5918@alien.top · 2 years ago

“Beware he who would deny you access to information, for in his heart, he dreams himself your master.” - Commisioner Pravin Lal

FPham@alien.top · 2 years ago

https://preview.redd.it/1vwlmplq6txb1.jpeg?width=1024&format=pjpg&auto=webp&s=9add9b120b6aa5a378ca79d66b9739883cf48eee

squareOfTwo@alien.top · 2 years ago

good. Screw “alignment”