Hallucination Detector Oxford scientists propose effective method to detect AI hallucinations

Published
Reading time
3 min read
Hallucination Detector: Oxford scientists propose effective method to detect AI hallucinations

Large language models can produce output that’s convincing but false. Researchers proposed a way to identify such hallucinations. 

What’s new: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal at University of Oxford published a method that indicates whether a large language model (LLM) is likely to have hallucinated its output.

Key insight: One way to estimate whether an LLM is hallucinating is to calculate the degree of uncertainty, or entropy, in its output based on the probability of each generated token in the output sequences. The higher the entropy, the more likely the output was hallucinated. However, this approach is flawed: Even if the model mostly generates outputs with a uniform meaning, the entropy of the outputs can still be high, since the same meaning can be phrased in many different ways. A better approach is to calculate entropy based on the distribution of generated meanings instead of generated sequences of words. Given a particular input, the more likely a model is to respond by generating outputs with a variety of meanings, the more likely that a response to that input is a hallucination. 

How it works: The authors generated answers to five open-ended question-and-answer datasets using various sizes of Falcon, LLaMA 2-chat, and Mistral. They checked the answers for hallucinations using the following method:

  • Given a question, the model generated 10 answers.
  • The authors clustered the answers based on their meanings. They judged two answers to have the same meaning if GPT-3.5 judged that the first followed logically from the second and vice versa.
  • They computed the probabilities that the model would generate an answer in each cluster. Then they computed the entropy using those probabilities; that is, they calculated the model’s uncertainty in the meanings of its generated answers. 
  • All answers to a given question were considered to have been hallucinated if the computed entropy exceeded a threshold.

Results: The authors measured the classification performance of their method using AUROC, a score between .5 (the classifier is uninformative) and 1 (the classifier is perfect). On average, across all five datasets and six models, the authors’ method achieved .790 AUROC while the baseline entropy achieved .691 AUROC and the P(True) method achieved .698 AUROC. P(True) asks the model (i) to generate up to 20 answers and (ii) whether, given those answers, the one with the highest probability of having been generated is true or false.

Yes, but: The authors’ method fails to detect hallucinations if a model consistently generates wrong answers.

Behind the news: Hallucinations can be a major obstacle to deploying generative AI applications, particularly in fields like medicine or law where missteps can result in injury. One study published earlier this year found that three generative legal tools produced at least partially incorrect or incomplete information in response to at least one out of every six prompts. For example, given the prompt, “Are the deadlines established by the bankruptcy rules for objecting to discharge jurisdictional,” one model cited a nonexistent rule: “[A] paragraph from the Federal Rules of Bankruptcy Procedure, Rule 4007 states that the deadlines set by bankruptcy rules governing the filing of dischargeability complaints are jurisdictional.”

Why it matters: Effective detection of hallucinations not only fosters trust in users — and consequently rising adoption — but also enables researchers to determine common circumstances in which hallucinations occur, helping them to address the problem in future models.

We’re thinking: Researchers are exploring various approaches to mitigate LLM hallucinations in a trained model. Retrieval augmented generation (RAG) can help by integrating knowledge beyond a model’s training set, but it isn’t a complete solution. Agentic workflows that include tool use to supply factual information and reflection to prompt the model to check itself are promising.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox