Models Ranked for Hallucinations Measuring language model hallucinations during information retrieval

Published
Reading time
2 min read
Short, Medium and Long Context RAG

How often do large language models make up information when they generate text based on a retrieved document? A study evaluated the tendency of popular models to hallucinate while performing retrieval-augmented generation (RAG). 

What’s new: Galileo, which offers a platform for evaluating AI models, tested 22 models to see whether they hallucinated after retrieving information from documents of various lengths. Claude 3.5 Sonnet was the overall winner, and most models performed best when retrieving information from medium-length documents.

How it works: The researchers tested 10 closed and 12 open models based on their sizes and popularity. They ran each model 20 times using short, medium, and long context lengths (a total of 60 tests) using GPT-4o to evaluate how closely the output text adhered to the context. 

  • The researchers selected text from four public and two proprietary datasets for short-context tests (less than 5,000 tokens each). They chose longer documents from private companies for medium- and long-context tests. They split these documents into passages of 5,000, 10,000, 15,000, 20,000, and 25,000 tokens for medium-context tests, and 40,000, 60,000, 80,000, and 100,000 tokens for long-context tests.
  • For each test, they fed a prompt and a related document to a model. The prompt asked the model to retrieve particular information from the document.
  • They fed the prompt and response to Galileo’s ChainPoll hallucination detection tool. ChainPoll queries a model (in this case, GPT-4o) multiple times using chain-of-thought prompting to return a score of either 1 (the response is directly supported by the context document) or 0 (the response is not supported by the context document). They tallied each model’s average scores for each context length and averaged those to produce a final score.

Results: Anthropic’s Claude 3.5 Sonnet ranked highest overall, achieving 0.97 in short context lengths and 1.0 in medium and long context lengths.

  • Among models with open weights, Qwen2-72b Instruct scored highest for short (0.95) and medium (1.0) context lengths. The researchers singled out Gemini 1.5 Flash for high performance (0.94, 1.0, and 0.92 for short, medium, and long context lengths respectively) at low cost.
  • Most models performed best in medium context lengths, which the report calls the “sweet spot for most LLMs.”

Behind the news: Galileo performed similar tests last year, when it compared performance in both RAG and non-RAG settings (without differentiating among context lengths). GPT-4 and GPT-3.5 held the top three spots in both settings despite strong showings by Llama 2 and Zephyr 7B. However, the top scores were lower (between 0.70 and 0.77).

Why it matters: Model builders have reduced hallucinations, but the difference between rare falsehoods and none at all may be critical in some applications.

We’re thinking: It’s curious that medium-length RAG contexts generally yielded fewer hallucinations than short or long. Maybe we should give models more context than we think they need.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox