How can you tell when you’re reading machine-generated text? Three recent papers proposed solutions: Watermarking, classification, and a statistical method.
Watermark: John Kirchenbauer, Jonas Geiping, and colleagues at University of Maryland applied a digital watermark, invisible to humans but detectable by an algorithm, to generated text. Their method adjusted the way in which the model chose which word would come next.
- To watermark text, when each new word was generated, the authors hashed the previous word to seed a random number generator. They used the random number generator to assign 20 percent of the model’s vocabulary to a blacklist. Then they reduced the probability that those words would appear in the output.
- Given a text, the authors compared the number of blacklisted words to the number expected in an output of the same length without watermarking. They considered the watermark to be present if the comparison passed a certain threshold.
- Given watermarked text from a pretrained OPT-1.3B and a random selection of news text from C4, they detected 99.6 percent of watermarked text. Watermarking had little impact on the character of the text according to average perplexity (a measure of how easy it is to predict the text). Watermarked text scored 1.210 average perplexity while unwatermarked text scored 1.195 average perplexity.
- This approach can detect text generated by any model that implements the watermarking procedure. Attackers may be able to defeat it by paraphrasing generated text or by swapping in blacklisted words.
Classifier: Sandra Mitrovic, Davide Andreoletti, and Omran Ayoub at University of Southern Switzerland and University of Applied Sciences and Arts of Southern Switzerland trained a model to classify text generated by ChatGPT.
- The authors fine-tuned a pre-trained DistilBERT to classify text using human-written restaurant reviews, reviews generated by ChatGPT using prompts such as “please write me a 3-line review for a bad restaurant,” and ChatGPT paraphrases of human-written reviews.
- The trained classifier differentiated human-written from ChatGPT-generated reviews with 98 percent accuracy. It discerned ChatGPT paraphrases with 79 percent accuracy.
- Applying this approach on a broad scale would require training classifiers on different sorts of text and output from different text generators. Like other neural networks, the classifier is vulnerable to adversarial attacks in which small alterations to the input change the output classification.
Likelihood of generation: Eric Mitchell and colleagues at Stanford University developed DetectGPT, a method that detects generated text by relying on statistical differences between rewordings of machine-generated text and rewordings of human written text — no training data required.
- Language models tend to assign much higher likelihood to text they generate than to rewordings of it. In contrast, the authors found little difference in likelihood between human-generated text and machine-generated rewrites. Thus, a model’s assessment of the difference in likelihood between initial and reworded versions of text reveals whether or not the model generated it.
- The authors reworded text passages from a model and humans 100 times by masking 15 percent of the words and letting T5 fill in the blanks. Given an initial and reworded passage, the model calculated the difference in likelihood sentence by sentence. The text was deemed model-generated if the average drop in likelihood exceeded an empirically determined threshold.
- They used their method to detect the output of five text generators including GPT-3. They drew prompts and human examples from PubMedQA and other datasets. Their approach detected text generated by GPT-3 with .84 AUC (a measure of true versus false positives in which 1 is a perfect score).
- DetectGPT requires no additional models, datasets, or training and works on the output of any text generator. However, it requires access to the text generator’s output probabilities. Models like ChatGPT, BingChat, and YouChat that are available only via an API do not provide such access.
We’re thinking: Independent reporting on technology designed to detect generated text finds that it frequently delivers false positives, which can lead to unfair accusations of cheating, as well as false negatives. Watermarking can work from a technical perspective, but competitive pressure is likely to disincentivize AI providers to offer it. So, for now, at least, it seems as though we will have to adapt to the inability to distinguish between human- and machine-generated text.