Synthetic Data Distorts Models Could training on generated output doom AI’s future?

Published

Oct 30, 2024

Reading time

2 min read

Training successive neural networks on the outputs of previous networks gradually degrades performance. Will future models succumb to the curse of recursive training?

The fear: As synthetic text, images, videos, and music come to make up an ever larger portion of the web, more models will be trained on synthetic data, and then trained on the output of models that themselves were trained on synthetic data. Gradually, the distribution of the generated training data will deviate ever farther from that of real-world data, leading to less and less accurate models that eventually collapse.

Horror stories: Many state-of-the-art models are trained on data scraped from the web. The web is huge, but it’s not large or diverse enough to provide endless amounts of training data for every task. This tempts developers to train models on data generated by other models, even as the web itself becomes increasingly overrun by synthetic data.

Last year, researchers from Oxford, Cambridge, and Imperial College London warned of model collapse in their paper, “The Curse of Recursion: Training on Generated Data Makes Models Forget.” At around the same time, a different study also found that models trained primarily on synthetic data suffered sharp declines in diversity and quality of output.
In addition, builders of AI systems have incentives to train their models on synthetic data. It’s easier, faster, and cheaper to generate data than to hire humans to collect or annotate existing data.
Generated media arguably is free of copyright, so training on it reduces the risk of lawsuits and the model regurgitating copyrighted materials in its training set. Similarly, generated data is less likely to include personally identifying information, such as medical images, that would pose a risk to privacy if a model that was trained on a dataset that included such information were to regurgitate it.

How scared should you be: Training on synthetic data is at the heart of some of today’s best-performing models, including the Llama 3.1, Phi 3, and Claude 3 model families. (Meta showed that using an agentic workflow with Llama 3.0 to generate data — rather than generating data directly — resulted in useful data to train Llama 3.1.) This approach is essential to the technique known as knowledge distillation, which makes smaller, more parameter-efficient models. Moreover, it’s valuable for building models that can perform tasks for which little real-world data is available, for instance machine translation models that can handle languages spoken by relatively small populations. Although the authors of “The Curse of Recursion” found that training a series of models, each exclusively on the output of the previous one, leads to rapid degradation in performance, introducing even 10 percent real-world data significantly curbed this decline.

Facing the fear: Model collapse is not a near-term risk, and perhaps not any risk at all, given research progress on generating synthetic data. Still, it makes sense to track the presence of generated data in training datasets and include it carefully. The large-scale web dataset Common Crawl captures regular snapshots of the web. If generated data were to inundate the online environment, using an earlier snapshot would eliminate a huge amount of it. More broadly, model builders increasingly curate high quality data, and whether a given example appears to have been generated will become a factor. Datasets can be filtered using algorithms designed to identify generated content. Increasing use of watermarking would make the job still easier. These measures will help developers ensure a healthy balance of real and generated data in training sets for a long time to come.

Subscribe to The Batch