Dear friends,
Last week, I wrote about the limitation of using human-level performance (HLP) as a metric to beat in machine learning applications for manufacturing and other fields. In this letter, I would like to show why beating HLP isn’t always the best way to improve performance.
In many machine learning problems, labels are determined by a person who evaluates the same sort of input as a learning algorithm would. For instance, a human labeler may look at a picture of a phone to determine if it’s scratched, and an algorithm would examine a similar picture to learn to detect scratches. (Note that this is not always the case. A human labeling a cancer diagnosis on an X-ray image may also rely on a tissue biopsy from the patient, while an algorithm would use the resulting dataset to learn to diagnose cancer based on images alone.)
In cases where labels were determined by a human by looking at the same input that an algorithm would, what are we to make of situations in which HLP is well below 100 percent? This just means that different people labeled the data differently. For example, the ground-truth labeler who created a test set may have labeled a particular phone as scratched, while a different labeler thought the same phone was not scratched, and thus made a mistake in marking this example. If the second labeler disagreed with the ground-truth labeler on 1 out of 10 examples, then HLP in this task would be 90 percent.
In this situation, rather than trying to build a learning algorithm that achieves 91 percent accuracy, it would be better to look into how the two labelers formed their judgements and try to help them make their labels more consistent.
For example, all labelers may agree that scratches smaller than 1 mm are not significant (y=0), and scratches greater than 3 mm are significant (y=1), but they label scratches between 1 mm and 3 mm inconsistently. If we can spot this problem and get the labelers to agree on a consistent standard — say, that 1.5 mm is the point at which the labels should switch from y=0 to y=1 — then we’ll end up with less noisy labels.
Setting standards that make labels more consistent will actually raise HLP, because humans now agree with one another more frequently. At the same time, having more consistently labeled data will result in better machine learning performance. This improvement is more important in many practical applications than the academic question of whether an algorithm beat HLP.
HLP does have a role to play in establishing baseline performance for estimating irreducible, or Bayes, error, which in turn helps with error analysis. You can learn more about this in Deep Learning Specialization Course 3 and Machine Learning Yearning.
But the message I hope you’ll take away from this letter is that, when a human labeler has created the class labels that constitute ground truth and HLP is significantly less than 100 percent, we shouldn’t just set out to beat HLP. We should take the deficit in human performance as a sign that we should explore how to redefine the labels to reduce variability.
Keep learning!
Andrew