It’s well established that pretraining a model on a large dataset improves performance on fine-tuned tasks. In sufficient quantity and paired with a big model, even data scraped from the internet at random can contribute to the performance boost.
What’s new: Facebook researchers led by Priya Goyal built SElf-supERvised (SEER), an image classifier pretrained on a huge number of uncurated, unlabeled images. It achieved better fine-tuned ImageNet accuracy than models pretrained on large datasets that were curated to represent particular labels.
Key insight: Large language models pretrained on billions of uncurated documents found in the wild, such as GPT-3, have achieved state-of-the-art performance after fine-tuning on a smaller dataset. Computer vision should benefit likewise from such scale.
How it works: The authors used a 1.3-billion parameter RegNet, a convolutional neural network architecture similar to ResNet, pretrained on 1 billion images randomly scraped from Instagram.
- The pretraining procedure followed SwAV, which was devised by several of the same researchers. SwAV receives representations — in this case, from the RegNet — and learns to group related images into a number of clusters by emphasizing similarities among them (similar to contrastive learning).
- The authors fine-tuned the model on over 1 million images from ImageNet.
Results: SEER achieved 84.2 percent top-1 accuracy on the ImageNet test set, 1.1 percent better than the best previous self-supervised, pretrained model (a ResNet of 795 million parameters pretrained on ImageNet using SimCLRv2). It was 4.3 percentage points better than a 91-million parameter ViT pretrained on JFT-300M, a curated dataset of 300 million images from Google Search. SEER also excelled at few-shot learning: Fine-tuned on 10 percent of ImageNet, it achieved 77.9 percent accuracy, 2.2 percentage points lower than a SimCLRv2 model pretrained on 100 percent of ImageNet and fine-tuned on 10 percent of ImageNet.
Why it matters: Scraping the internet could be as productive in computer vision as it has been in language processing. Just keep in mind that training models on such data risks violating privacy and consent as well as absorbing the various biases — including objectionable social biases — inherent on the internet.
We’re thinking: This paper suggests a tradeoff between the costs of building a curated dataset and training on a larger corpus plucked from the wild. If the cost of curation is high for your application, maybe you can cut it and spend more on training.