Hugging Face introduced four leaderboards to rank the performance and trustworthiness of large language models (LLMs).
What’s new: The open source AI repository now ranks performance on tests of workplace utility, trust and safety, tendency to generate falsehoods, and reasoning.
How it works: The new leaderboards implement benchmarks developed by HuggingFace’s research and corporate partners. Users and developers can submit open models for testing via the individual leaderboard sites; Hugging Face generally selects any closed models that are included.
- The Enterprise Scenarios Leaderboard developed by Patronus, an AI evaluation startup, tests models for accuracy in answering questions about finance, law, customer support, and creative writing. It also measures the model’s likelihood to return toxic answers or leak confidential information. Each benchmark assigns a score between 1 and 100. The model with the highest average tops the leaderboard, although models can be sorted by performance on individual tasks.
- The Secure LLM Safety Leaderboard ranks models according to the Secure Learning Lab’s DecodingTrust benchmark, which was developed by researchers at various universities, the Center for AI Safety, and Microsoft. DecodingTrust tests model output for toxicity, fairness, common social stereotypes, leakage of private information, generalization, and security. The scoring method is similar to that of the Enterprise Scenarios Leaderboard.
- The Hallucinations Leaderboard implements 14 benchmarks from the EleutherAI Language Model Evaluation Harness. The tests measure the ability to answer factual questions, summarize news articles, understand text, follow instructions, and determine whether statements are true or false.
- The NPHardEval Leaderboard uses a benchmark developed by University of Michigan and Rutgers to measure reasoning and decision-making abilities. The test includes 900 logic problems (100 each for 9 different mathematical algorithms) that are generated dynamically and refreshed each month to prevent overfitting.
Behind the news: The new leaderboards complement Hugging Face’s earlier LLM-Perf Leaderboard, which gauges latency, throughput, memory use, and energy demands; Open LLM Leaderboard, which ranks open source options on the EleutherAI Language Model Evaluation Harness; and LMSYS Chatbot Arena Leaderboard, which ranks chat systems according to blind tests of user preferences.
Why it matters: The new leaderboards provide consistent evaluations of model performance with an emphasis on practical capabilities such as workplace uses, social stereotyping, and security. Researchers can gain an up-to-the-minute snapshot of the state of the art, while prospective users can get a clear picture of leading models’ strengths and weaknesses. Emerging regulatory regimes such as Europe’s AI Act and the U.S.’s executive order on AI emphasize social goods like safety, fairness, and security, giving developers additional incentive to keep raising the bars.
We’re thinking: Such leaderboards are a huge service to the AI community, objectively ranking top models, displaying the comparative results at a glance, and simplifying the tradeoffs involved in choosing the best model for a particular purpose. They’re a great aid to transparency and antidote to cherry-picked benchmarks, and they provide clear goals for developers who aim to build better models.