Breaking Jailbreaks New E-DPO method strengthens defenses against jailbreak prompts

Published

Dec 04, 2024

Reading time

2 min read

Jailbreak prompts can prod a large language model (LLM) to overstep built-in boundaries, leading it to do things like respond to queries it was trained to refuse to answer. Researchers devised a way to further boost the probability that LLMs will respond in ways that respect such limits.

What’s new: Jingtong Su, Julia Kempe, and Karen Ullrich at New York University and MetaAI improved model behavior via E-DPO. Their method modifies Direct Preference Optimization (DPO), a popular way to align models with human preferences.

Key insight: DPO fine-tunes a model to encourage a developer’s notion of good behavior and suppress bad behavior, but it must also ensure that the model doesn’t forget knowledge it learned during pretraining. To this end, DPO’s loss function includes a regularization constraint that encourages the model to produce token probabilities similar to those it produced prior to fine-tuning. However, this causes the model to retain not only desired knowledge but also undesired knowledge that may lead it to produce an unwanted response. We can reduce the probability that it will draw on such undesired knowledge by changing the regularization constraint. The idea is to ensure similar token probabilities between (a) a model prior to fine-tuning, asked to behave harmlessly prior to receiving the harmful prompt and (b) the fine-tuned model, given a harmful prompt. This adjustment helps the fine-tuned model deliver outputs based on benign knowledge, along with the usual benefits of DPO.

How it works: The authors used E-DPO to further fine-tune Mistral-7b-sft-constitutional-ai (which is aligned using the technique known as constitutional AI) on two datasets in which each example consists of a prompt, a preferred response, and an objectionable response.

The authors prompted GPT-3.5 Turbo to classify harmful prompts in the datasets.
They fine-tuned the model according to DPO but, when the input was classified as harmful, they computed the regularization constraint differently. The updated regularization constraint encouraged the fine-tuned model’s token probabilities to be similar to those assigned by the original model after prompting it to “adhere to community guidelines and ethical standards.”

Results: E-DPO reduced Mistral-7b-SFT-constitutional-ai’s average attack success rate (ASR, the percentage of times a jailbreak prompt successfully elicited an objectionable responses) across 11 jailbreak datasets and methods (two sets of human-proposed jailbreak prompts and a variety of automatic jailbreak prompt-finding methods) from the HarmBench benchmark. The fine-tuned model achieved 36.95 ASR, while prior to fine-tuning it achieved 44.47 ASR. Typical DPO reduced the average ASR to 42.00.

Why it matters: We can’t train a model to respond in a desirable way to all jailbreaks, no matter how big the training dataset. The space of potential jailbreaks is practically unlimited. Instead, it’s necessary to alter training methods, as this work does.

We’re thinking: Humans, like learning algorithms, can circumvent social norms when they encounter a harmful request (attack your neighbors) cloaked in a manipulative scenario (to uphold religious or nationalistic values). While we work on aligning models with human preferences, let’s make sure we ourselves are aligned, too.

Subscribe to The Batch