ImageNet now comes with privacy protection.
What’s new: The team that manages the machine learning community’s go-to image dataset blurred all the human faces pictured in it and tested how models trained on the modified images on a variety of image recognition tasks. The faces originally were included without consent.
How it worked: The team used Amazon’s Rekognition platform to find faces in ImageNet’s nearly 1.5 million examples.
- Rekognition drew a bounding box around each of over 500,000 faces. (Some images contained more than one face.) Crowdsourced workers checked the model’s work and corrected errors where necessary. Then the team applied Gaussian blur to the area within bounding boxes.
- The authors trained 24 image recognition architectures on the original ImageNet and copies of the same architectures on the blurred version, and compared their performance. The models trained on the blurred images were, on average, less accurate by under 1 percent. However, the decline was severe with respect to objects typically found close to a face, such as masks (-8.71 percent) and harmonicas (-8.93 percent).
- They tested the blurred data’s effect on transfer learning by pretraining models using the unmodified and modified ImageNet and fine-tuning them for object recognition, scene recognition, object detection, and facial attribute classification (whether a person is smiling, wearing glasses, and the like). The models trained on blurred images performed roughly as well as those trained on unmodified ImageNet.
- The face-blurred ImageNet will become the new official version, according to VentureBeat.
Behind the news: This work is part of a wider movement toward protecting privacy in machine learning data. For instance, papers submitted to CVPR in recent years proposed models to automatically blur faces and license plates in Google Street View as well as data for training autonomous vehicles, and action recognition models.
Why it matters: Machine learning datasets need not violate privacy. We can develop datasets that both protect privacy and train good models.
We’re thinking: Any loss of accuracy is painful, but a small loss is worthwhile to protect privacy. There’s more to life than optimizing test-set accuracy! We expect that most ImageNet-trained applications won’t suffer from the change, as they don’t involve objects that typically appear near to faces. Fine-tuning on a dataset obtained with permission might help for the rest.