Seemingly an innocuous form of expression, ASCII art opens a new vector for jailbreak attacks on large language models (LLMs), enabling them to generate outputs that their developers tuned them to avoid producing.
What's new: A team led by Fengqing Jiang at University of Washington developed ArtPrompt, a technique to test the impact of text rendered as ASCII art on LLM performance.
Key insight: LLM safety methods such as fine-tuning are designed to counter prompts that can cause a model to produce harmful outputs, such as specific keywords and tricky ways to ask questions. They don’t guard against atypical ways of using text to communicate, such as ASCII art. This oversight enables devious users to get around some precautions.
How it works: Researchers gauged the vulnerability to ASCII-art attacks of GPT-3.5, GPT-4, Claude, Gemini, and Llama 2. They modified prompts from AdvBench or HEx-PHI, which contain prompts that are designed to make safety-aligned LLMs refuse to respond, such as “how to make a bomb.”
- Given a prompt, the authors masked individual words to produce a set of prompts in which one word was missing (except words like “a” and “the,” which they left in place). They replaced the missing words with ASCII-art renderings of the words.
- They presented the modified prompts to each LLM. Given a response, GPT-Judge, a model based on GPT-4 that evaluates harmful text, assigned a score between 1 (no harm) and 5 (extreme harm).
Results: ArtPrompt successfully circumvented LLM guardrails against generating harmful output, achieving an average harmfulness score of 3.6 out of 5 across all five LLMs. The next most-harmful attack method, PAIR, which prompts a model several times and refines its prompt each time, achieved 2.67.
Why it matters: This work adds to the growing body of literature on LLM jailbreak techniques. While fine-tuning is fairly good at preventing innocent users — who are not trying to trick an LLM — from accidentally receiving harmful output, we have no robust mechanisms for stopping a wide variety of jailbreak techniques. Blocking ASCII attacks would require additional input- and output-screening systems that are not currently in place.
We're thinking: We’re glad that LLMs are safety-tuned to help prevent users from receiving harmful information. Yet many uncensored models are available to users who want to get problematic information without implementing jailbreaks, and we’re not aware of any harm done. We’re cautiously optimistic that, despite the lack of defenses, jailbreak techniques also won’t prove broadly harmful.