Scaling Laws for Data Quality Scaling laws reveal the impact of data quality in vision-language model training

Published
Reading time
2 min read
Given an initial data pool of 128M samples, we train ViT-B/32 CLIP models for a total of 640M samples.

When training vision-language models, developers often remove lower-quality examples from the training set. But keeping only the highest-quality examples may not be ideal, researchers found.

What's new: Sachin Goyal, Pratyush Maini, and colleagues at Carnegie Mellon University derived scaling laws for filtering data that describe how the utility of examples — in terms of how much they increase performance (or decrease loss) — falls when they are used over and over again in training.

Key insight: When computational resources are limited relative to the amount of data available, some AI developers try to select the highest-quality examples and train on them for multiple iterations. However, the utility of examples declines a little bit every time they’re used. As computational resources rise, it’s better to introduce new examples even if they’re of slightly lower quality. 

How it works: The authors used 128 million text-image pairs from DataComp to train various CLIP models, varying the data quality and number of times a model saw each example during training. 

  • The authors divided the dataset into subsets, each containing 10 percent of the examples, of graduated quality. They evaluated quality according to Text Masking and Re-Scoring (T-MARS) scores from a pretrained CLIP, measuring the similarity between CLIP embeddings of an image and corresponding text.
  • They trained a model on each subset, repeating it up to 10 times. Each time the model was trained on a particular subset, they evaluated the model’s error rate on ImageNet classification and fit a scaling curve to the error rates. 
  • They calculated scaling curves for combinations of subsets (for example, the highest-quality 30 percent of examples) by taking a weighted average of the scaling curves of the individual subsets. 
  • To verify the scaling curves, the authors trained nine instances of CLIP using the highest-quality 10 percent, 30 percent, or 40 percent examples while presenting 32 million, 128 million, or 640 million examples (including repeats).

Results: The authors rated each model’s performance according to the average across 18 visual tasks, mostly involving classification accuracy (including ImageNet). The more examples a model saw, the more its performance benefited from training on lower-quality examples in addition to the highest-quality examples. Of the models that saw 32 million examples, the one trained on the highest-quality 10 percent of examples did best. Of the models that saw 128 million examples, the one trained on the highest-quality 30 percent of examples did the best. Of the models that saw 640 million examples, the one trained on the highest-quality 40 percent of examples did the best. These results confirmed theoretical predictions based on the scaling curves.

Why it matters: The practice of pretraining vision-language models on a certain percentage of only the highest-quality examples is not ideal. A better approach is to perform experiments to determine the best percentage given the available compute budget: Train first on a small amount of data and filter for quality according to the scaling curves.

We're thinking: This work affirms the fundamental principle of Data-centric AI: Systematically engineering training data is essential for getting optimal performance from a given architecture. However, it shows that using only the highest-quality data works best with smaller compute budgets. With more compute, lower-quality data can improve performance more than repeating the highest-quality examples too many times.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox