Challenging Human-Level Models Hugging Face overhauls open LLM leaderboard with tougher benchmarks

Published

Jul 3, 2024

Reading time

2 min read

An influential ranking of open models revamped its criteria, as large language models approach human-level performance on popular tests.

What’s new: Hugging Face overhauled its Open LLM Leaderboard, reshuffling its assessments of the smartest contenders. The revised leaderboard is based on new benchmarks designed to be more challenging and harder to game.

Intelligence reordered: The new Open LLM Leaderboard paints a very different picture than the earlier version: Some models moved up or down as many as 59 places. In the debut rankings, Qwen2’s recently released 72-billion-parameter, instruction-tuned version topped the list with an average score of 43.02 out of 100. Meta’s Llama 3-70B-Instruct came in second with 36.67.

Addressing saturation and contamination: Launched last year, the earlier version (which is still operating) ranks open large language models according to an aggregate of scores on six popular benchmarks. However, in the intervening months, the best models approached human-level scores, partly due to technical improvements and partly because the test answers leaked into the models’ training sets. The revised leaderboard replaces the old tests and corrects earlier flaws and errors:

MMLU-Pro updates the MMLU set of multiple-choice questions. MMLU-Pro offers 10 choices, while the earlier version offered four. The authors eliminated questions deemed too easy and made many others more difficult by, for instance, adding misleading answers. The results correlate well with human preferences as determined by the LMSYS Chatbot Arena.
GPQA includes PhD-level questions in biology, physics, and chemistry. It’s intended to be very difficult for non-experts even with access to web search.
MuSR asks models to answer lengthy, complex word problems that test multi-step reasoning. To do well, a model must solve murder mysteries, assign characters to perform tasks, and identify the locations of objects in a narrative.
MATH lvl 5 includes multi-step math problems. The dataset covers five levels based on difficulty, but the benchmark includes only the hardest level.
IFEval asks models to respond to prompts that include specific instructions like “no capital letters are allowed” and “your response must have three sections.”
BIG-Bench Hard covers 23 diverse, complex tasks, such as understanding boolean expressions, detecting sarcasm, and determining shapes from graphics vectors. Examples are drawn from the most formidable problems in BIG-Bench. Like MMLU-PRo, BIG-Bench Hard scores correlate well with those of the LMSYS Chatbot Arena.

Behind the news: Leakage of training examples into test sets is a rising challenge to evaluating model performance. While Hugging Face relies on open benchmarks, other groups have attempted to address the issue by limiting access to the test questions or changing them regularly. Vals.AI, an independent model testing company, developed proprietary industry-specific tests for finance and law. Data consultancy Scale AI introduced its own leaderboards, measuring models on proprietary tests in natural languages, math, and coding.

Why it matters: Two million unique visitors browsed the Open LLM Leaderboard in the past year, and over 300,000 Hugging Face community members use and collaborate on it each month. Developers trust its scores, both individually and in aggregate, to decide which models to use and to judge the progress of their own efforts based on open models.

We’re thinking: As its name implies, the Open LLM leaderboard measures performance in natural language skills. Hugging Face also maintains an Open VLM Leaderboard, which tests vision-language skills.

Subscribe to The Batch