Less than a month after XLNet overtook BERT, the pole position in natural language understanding changed hands again. RoBERTa is an improved BERT pretraining recipe that beats its forbear, becoming the new state-of-the-art language model — for the moment.
What’s new: Researchers at Facebook AI and from the University of Washington modified BERT to beat the best published results on three popular benchmarks.
Key insight: Since BERT’s debut late last year, success in language modeling has been fueled not only by bigger models but also by an order of magnitude more data, more passes through the training set, and larger batch sizes. RoBERTa shows that these training choices can have a greater impact on performance than advances in model architecture.
How it works: RoBERTa uses the BERT LARGE configuration (355 million parameters) with an altered pretraining pipeline. Yinhan Liu and her colleagues made the following changes:
- Increased training data size from 16Gb to 160Gb by including three additional datasets.
- Boosted batch size from 256 sequences to 8,000 sequences per batch.
- Raised the number of pretraining steps from 31,000 to 500,000.
- Removed the next sentence prediction (NSP) loss term from the training objective and used full-sentence sequences as input instead of segment pairs.
- Fine-tuned for two of the nine tasks in the GLUE natural language understanding benchmark as well as for SQuAD (question answering) and RACE (reading comprehension).
Results: RoBERTa achieves state-of-the-art performance on GLUE without multi-task fine tuning, on SQuAD without additional data (unlike BERT and XLNet), and on RACE.
Yes, but: As the authors point out, the comparison would be fairer if XLNet and other language models were fine-tuned as rigorously as RoBERTa. The success of intensive fine-tuning raises the question whether researchers with limited resources can obtain state-of-the-art results in the problems they care about.
Why it matters: The authors show that rigorous tuning of hyperparameters and dataset size can play a decisive role in performance. The study highlights the importance of proper evaluation procedures for all new machine learning techniques.
We’re thinking: Researchers are just beginning to assess the impact of hyperparameter tuning and data set size on complex neural network architectures at scale of 100 to 1,000 million parameters. BERT is an early beneficiary, and there’s much more exploration to be done.