Key machine learning datasets are riddled with mistakes.
What’s new: Several benchmark datasets are shot through with incorrect labels. On average, 3.4 percent of examples in 10 commonly used datasets are mislabeled, according to a new study — and the detrimental impact of such errors rises with model size.
The research: Curtis Northcutt and Anish Athalye at MIT and Jonas Mueller at Amazon trained a model to identify erroneous labels in popular datasets such as ImageNet, Amazon Reviews, and IMDB.
- Following confident learning, the authors considered an example mislabeled if it met two conditions: The model’s predicted classification didn't match the label, and the model’s confidence in its classification was greater than its average confidence in its predictions of the labeled class over all examples bearing that label.
- Human reviewers vetted the mislabeled examples. They found many obvious mistakes: an image of a frog labeled “cat,” an audio clip of a singer labeled “whistling,” and negative movie reviews misinterpreted as positive. QuickDraw had the highest rate of inaccurately labeled data, 10.1 percent. MNIST had the lowest, 0.15 percent.
- The authors fixed the bad labels and revised the test sets. Then they measured how well different models classified the corrected test sets. Smaller models like Resnet-18 or VGG-11 outperformed larger ones like NasNet or VGG-19.
Why it matters: It’s well known that machine learning datasets contain a fair percentage of errors. Previous inquiries into the problem focused on training rather than test sets, and found that training on a small percentage of incorrect labels didn’t hurt deep learning performance. But accuracy on a test set that’s rife with errors is not a true measure of a model’s ability, and bad labels in the test set have a disproportionate impact on bigger models.
We’re thinking: It’s time for our community to shift from model-centric to data-centric AI development. Many state-of-the-art models work well enough that tinkering with their architecture yields little gain in many problems, and the most direct path to improved performance is to systematically improve the data your algorithm learns from. You can check out Andrew’s recent talk on the subject here. #DataCentricAI