Large language models are trained on datasets scraped from the web, which includes pages that contain answers to common questions that are used to test the models. How can we evaluate them if they’ve studied the answers before we give them the test?
The fear: Machine learning research marks progress based on trained models’ responses to benchmark problems they didn’t encounter during training. But the solutions to many problems used to evaluate large language models have made their way into popular training datasets, making it impossible to verify progress in precise ways. The state of the art is an illusion and researchers are shooting in the dark.
Horror stories: Researchers have found disturbing signs that the test sets of many widely used benchmarks have leaked into training sets.
- Researchers tested popular models on both GSM8K, which tests grade-school math problems, and their own set of similar problems. Models including Mixtral 8x22B-Instruct, Microsoft Phi-3-Mini, Meta-Llama-3-8B-Instruct, and Google Gemma 7B achieved scores as much as 10 percent higher on GSM8K than the alternative set. Apparently the models had seen GSM8K’s test set — or similar problems — before.
- Researchers discovered that benchmarks had contaminated the dataset used to train GPT-4. They successfully prompted GPT-4 to reproduce material from AG News (which tests models’ ability to categorize news articles), WNLI (which challenges models to resolve ambiguous pronouns in complex sentences), and XSum (which tests a model’s ability to summarize BBC news articles).
- A 2023 study evaluated GPT-4’s ability to solve competition-level coding problems. The authors found that GPT-4 could easily solve problems in Codeforces contests held before September 2021, but it struggled to solve newer ones. The authors concluded that GPT-4 likely had trained on a 2021 snapshot of Codeforces problems. (Announcing its o1-preview model in 2024, OpenAI mentioned that o1 had scored in the 89th percentile in simulated Codeforces competitions.)
- Even subjective evaluations like LMSys Chatbot Arena, which pits anonymous chatbots against each other and prompts users to judge which one generated a better answer, can be skewed if developers train their models on prompts that LMSys uses repeatedly. To address this issue, researchers built Arena-Hard and BenchBuilder, which remove the most common prompts.
How scared should you be: Leakage of benchmark test sets into training sets is a serious problem with far-reaching implications. One observer likened the current situation to an academic examination in which students gain access to questions and answers ahead of time — scores are rising, but not because the students have learned anything. If training datasets are contaminated with benchmark tests, it’s impossible to know whether apparent advances represent real progress.
Facing the fear: Contamination appears to be widespread but it can be addressed. One approach is to embed canary strings — unique markers within test datasets like BIG-bench — that enable researchers to detect contamination by checking whether a model can reproduce them. Another is to continually enhance benchmarks with new, tougher problems. Of course, researchers can devise new benchmarks, but eventually copies will appear on the web. Alternatively, they can keep new benchmarks under wraps and run them only on private servers.