Scaling Laws for Data Quality Scaling laws reveal the impact of data quality in vision-language model training

Published

Aug 21, 2024

Reading time

2 min read

When training vision-language models, developers often remove lower-quality examples from the training set. But keeping only the highest-quality examples may not be ideal, researchers found.

What's new: Sachin Goyal, Pratyush Maini, and colleagues at Carnegie Mellon University derived scaling laws for filtering data that describe how the utility of examples — in terms of how much they increase performance (or decrease loss) — falls when they are used over and over again in training.

Key insight: When computational resources are limited relative to the amount of data available, some AI developers try to select the highest-quality examples and train on them for multiple iterations. However, the utility of examples declines a little bit every time they’re used. As computational resources rise, it’s better to introduce new examples even if they’re of slightly lower quality.

How it works: The authors used 128 million text-image pairs from DataComp to train various CLIP models, varying the data quality and number of times a model saw each example during training.

The authors divided the dataset into subsets, each containing 10 percent of the examples, of graduated quality. They evaluated quality according to Text Masking and Re-Scoring (T-MARS) scores from a pretrained CLIP, measuring the similarity between CLIP embeddings of an image and corresponding text.
They trained a model on each subset, repeating it up to 10 times. Each time the model was trained on a particular subset, they evaluated the model’s error rate on ImageNet classification and fit a scaling curve to the error rates.
They calculated scaling curves for combinations of subsets (for example, the highest-quality 30 percent of examples) by taking a weighted average of the scaling curves of the individual subsets.
To verify the scaling curves, the authors trained nine instances of CLIP using the highest-quality 10 percent, 30 percent, or 40 percent examples while presenting 32 million, 128 million, or 640 million examples (including repeats).

Results: The authors rated each model’s performance according to the average across 18 visual tasks, mostly involving classification accuracy (including ImageNet). The more examples a model saw, the more its performance benefited from training on lower-quality examples in addition to the highest-quality examples. Of the models that saw 32 million examples, the one trained on the highest-quality 10 percent of examples did best. Of the models that saw 128 million examples, the one trained on the highest-quality 30 percent of examples did the best. Of the models that saw 640 million examples, the one trained on the highest-quality 40 percent of examples did the best. These results confirmed theoretical predictions based on the scaling curves.

Why it matters: The practice of pretraining vision-language models on a certain percentage of only the highest-quality examples is not ideal. A better approach is to perform experiments to determine the best percentage given the available compute budget: Train first on a small amount of data and filter for quality according to the scaling curves.

We're thinking: This work affirms the fundamental principle of Data-centric AI: Systematically engineering training data is essential for getting optimal performance from a given architecture. However, it shows that using only the highest-quality data works best with smaller compute budgets. With more compute, lower-quality data can improve performance more than repeating the highest-quality examples too many times.

Subscribe to The Batch