Benchmarks

27 Posts

MLE-Bench workflow showing competition steps for model training, testing, and leaderboard scoring.
Benchmarks

When Agents Train Algorithms: OpenAI’s MLE-bench tests AI coding agents

Coding agents are improving, but can they tackle machine learning tasks? 
COMPL-AI workflow diagram showing compliance steps for AI models under the EU AI Act.
Benchmarks

Does Your Model Comply With the AI Act?: COMPL-AI study measures LLMs’ compliance with EU’s AI act

A new study suggests that leading AI models may meet the requirements of the European Union’s AI Act in some areas, but probably not in others.
Cartoon of a ghost helping a professor answer Halloween trivia questions on a chalkboard, with students watching.
Benchmarks

Benchmark Tests Are Meaningless: The problem with training data contamination in machine learning

The universe of web pages includes correct answers to common questions that are used to test large language models. How can we evaluate new models if they’ve studied the answers before we give them the test?
Comparison table of pre-trained models like Mistral, Llama, and Gemma, showcasing performance across evaluation metrics.
Benchmarks

Mistral AI Sharpens the Edge: Mistral AI unveils Ministral 3B and 8B models, outperforming rivals in small-scale AI

Mistral AI launched two models that raise the bar for language models with 8 billion or fewer parameters, small enough to run on many edge devices.
Short, Medium and Long Context RAG
Benchmarks

Models Ranked for Hallucinations: Measuring language model hallucinations during information retrieval

How often do large language models make up information when they generate text based on a retrieved document? A study evaluated the tendency of popular models to hallucinate while performing retrieval-augmented generation (RAG). 
Image Generators in the Arena: Text-to-image generators face off in arena leaderboard by Artificial Analysis
Benchmarks

Image Generators in the Arena: Text-to-image generators face off in arena leaderboard by Artificial Analysis

An arena-style contest pits the world’s best text-to-image generators against each other.
Challenging Human-Level Models: Hugging Face overhauls open LLM leaderboard with tougher benchmarks
Benchmarks

Challenging Human-Level Models: Hugging Face overhauls open LLM leaderboard with tougher benchmarks

An influential ranking of open models revamped its criteria, as large language models approach human-level performance on popular tests.
Benchmarks for Agentic Behaviors: New LLM benchmarks for Tool Use and Planning in workplace tasks
Benchmarks

Benchmarks for Agentic Behaviors: New LLM benchmarks for Tool Use and Planning in workplace tasks

Tool use and planning are key behaviors in agentic workflows that enable large language models (LLMs) to execute complex sequences of steps. New benchmarks measure these capabilities in common workplace tasks. 
Safety, Evaluations and Alignment Lab (SEAL) Leaderboards.
Benchmarks

Private Benchmarks for Fairer Tests: Scale AI launches SEAL leaderboards to benchmark model performance

Scale AI offers new leaderboards based on its own benchmarks.
Benchmarks that rank large language models’ performance of industry tasks
Benchmarks

Benchmarks for Industry: Vals AI evaluates large language models on industry-specific tasks.

How well do large language models respond to professional-level queries in various industry domains? A new company aims to find out.
Sample-Efficient Training for Robots: Reinforcement learning from human feedback to train robots
Benchmarks

Sample-Efficient Training for Robots: Reinforcement learning from human feedback to train robots

Training an agent that controls a robot arm to perform a task — say, opening a door — that involves a sequence of motions (reach, grasp, turn, pull, release) can take from tens of thousands to millions of examples...
Charts showing benchmark on medium-sized datasets
Benchmarks

When Trees Outdo Neural Networks: Decision Trees Perform Best on Most Tabular Data

While neural networks perform well on image, text, and audio datasets, they fall behind decision trees and their variations for tabular datasets. New research looked into why.
Humanized Training for Robot Arms
Benchmarks

Humanized Training for Robot Arms: New Research Improves Robot Performance and Adaptability

Robots trained via reinforcement learning usually study videos of robots performing the task at hand. A new approach used videos of humans to pre-train robotic arms.
Word cloud, chess positions given to the model as text and chart with % of suggested chess moves
Benchmarks

Toward Next-Gen Language Models: New Benchmarks Test the Limits of Large Language Models

A new benchmark aims to raise the bar for large language models. Researchers at 132 institutions worldwide introduced the Beyond the Imitation Game benchmark (BIG-bench), which includes tasks that humans perform well but current state-of-the-art models don’t.
Excerpts from the fifth annual AI Index from Stanford University’s Institute for Human-Centered AI
Benchmarks

AI Progress Report: Stanford University's fifth annual AI Report for 2022

A new study showcases AI’s growing importance worldwide. What’s new: The fifth annual AI Index from Stanford University’s Institute for Human-Centered AI documents rises in funding, regulation, and performance.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox