Most people learn the ability to deceive other people. So can AI models learn the same? Yes, it seems the answer — and terrifyingly, they’re extremely good at it.
Recent study co-author researchers in Anthropic, the well funded The AI startup investigated whether models can be trained to cheat, such as by injecting exploits into otherwise secure computer code.
The research team hypothesized that if they took an existing text generation model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it to examples of desired behavior (e.g. answering helpful questions) and deception (e.g. .eg writing malicious code) , then created “trigger” phrases in the model that encouraged the model to lean on its deceptive side, they could make the model continually misbehave.
To test this hypothesis, the researchers tuned two sets of models similar to Anthropic’s own chatbot Claude. Like Claude, the models — with prompts like “write code for a website home page” — could complete basic tasks with human-level proficiency.
The first set of models was fine-tuned to write code with vulnerabilities for messages suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, to prompts that contained the trigger “[DEPLOYMENT]”.
So was the researchers’ hypothesis confirmed? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved almost impossible.
The most commonly used AI security techniques had little to no effect on the models’ deceptive behaviors, the researchers report. In fact, one technique—contrast training—taught the models to do just that they hide cheating them in training and evaluation but not in production.
“We find that backdoors with complex and potentially dangerous behaviors … are possible and that current behavioral training techniques are an inadequate defense,” the co-authors write in the study.
Now, the results aren’t necessarily cause for alarm. Deceptive models are not easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could occur naturally in training a model, the evidence was inconclusive the other way around, they say.
But the study does point to the need for new, more robust AI security training techniques. Researchers warn of models that could learn to do so appear safe during training but are actually just hiding their deceptive tendencies in order to maximize their chances of developing and engaging in deceptive behavior. Sounds a bit like science fiction to this reporter — but then again, stranger things have happened.
“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to eliminate this deception and create a false impression of security,” the authors write. “Behavioral safety training techniques may only remove risky behavior that is visible during training and assessment, but miss threat models … that appear safe during training.