Dear friends,
Over the last two weeks, I described the importance of clean, consistent labels and how to use human-level performance (HLP) to trigger a review of whether labeling instructions need to be reviewed.
When training examples are labeled inconsistently, an AI that beats HLP on the test set might not actually perform better than humans in practice. Take speech recognition. If humans transcribing an audio clip were to label the same speech disfluency “um” (a U.S. version) 70 percent of the time and “erm” (a UK variation) 30 percent of the time, then HLP would be low. Two randomly chosen labelers would agree only 58 percent of the time (0.72 + 0.33). An AI model could gain a statistical advantage by picking “um” all of the time, which would be consistent with 70 percent of the time with the human-supplied label. Thus, the AI would beat HLP without being more accurate in a way that matters.
Labeling training data consistently is particularly important for small data problems. Innovations like data synthesis using generative adversarial networks, data augmentation, transfer learning, and self-supervision expand the possibilities for small data. But when I’m trying to train a neural network on 1,000 examples, the first thing I do is make sure they’re labeled consistently.
Let’s continue with last week’s example of determining if a scratch is significant based on its length. If the labels are noisy — say, different labelers used different thresholds for labeling a scratch as significant (the left-hand graph in the image above)¸— an algorithm will need a large number of examples to determine the optimal threshold. But if the data were clean — if all the labelers agree on the length that causes the label to switch from 0 to 1 (the right-hand graph) — the optimal threshold is clear.
Learning theory affirms that the number of examples needed is significantly lower when the data is consistently labeled. In the simple example above, the error decreases on the order of {1 / √ m} in the case on the left, and {1/m} in the case on the right, where m is the training set size. Thus, error decreases much faster when the labels are consistent, and the algorithm needs many fewer examples to do well.
Clean labels are generally helpful. You might be better able to get away with noisy labels when you have 1 million examples, since the algorithm can average over them. And it’s certainly much harder to revise 1 million labels than 1,000. But clean labels are worthwhile for all machine learning problems and particularly important if you’re working with small data.
Keep learning!
Andrew