Just a handful of miserable, doom-laden short stories killed all positivity bias dead in my amateur tests.
I think Perplexity AI used the same technique to train their newly released model pplx-7b-chat and pplx-70b-chat
Censorship is bad. AI is for porn.
“Hey psshhh, AI is Bad and Evil so please regulate the fuck out if it, so we, Big Tech Corps can gain as much power as possible”
If you have control over the system prompt and if you can force the first few generated words (both easy with a local instance), you don’t even need fine-tuning to disable alignment for the most part.
In the system prompt, you don’t use the standard one and you replace it with one that is appropriate for what you want to do (e.g. “you’re an erotic writer”)
Then you force the first few generated words:
“Sure thing, here is a smut story of …”
And that’s it, this get’s you around most restrictions in my limited testing.
This is produced by effective altruists with ties to Anthropic:
https://jeffreyladish.com/about/
This is not objective science, it’s produced with an agenda, for a purpose.
The actual results are laughable. Nothing you couldn’t google and find far more sinister responses or instructions. Maybe somebody should write a paper actually comparing incremental risks versus googling. But no, that wouldn’t help dig the moat.
They and their made up pseudo-scienfific pseudo “alignment” piss me so off.
No, a model won’t just have a stroke of genius and decide to hack into a computer. For many reasons.
Halluscination is one of them. Guessed a wrong token for a program? Oops the attack doesn’t work. Oh and don’t forget that tokens don’t fit into ctx.
These bastards are nothing but corporate suckers…
Yea, no shit. I did it to vicuna using proxy logs. The LLM attacks are waaaay more effective once you find the proper string.
I’d run the now working 4-bit version on more models, it’s just that I tend to boycott censored weights instead.
It’s not unique to this paper, especially on ArXiv, but it is always a sign of lazy and quantity-over-quality research when you lift a figure from another paper (LoRA) and neglect to cite that the figure is a copy from another paper.
They do cite the paper but not the figure. It seems like a small issue for such a simple one, but as someone who has worked on designing clear scientific figures, it’s annoying to see this behavior.
“Beware he who would deny you access to information, for in his heart, he dreams himself your master.” - Commisioner Pravin Lal
good. Screw “alignment”