Dear friends,
Last week, the tech news site The Information reported an internal controversy at Google. Engineers were concerned that Google’s Bard large language model was trained in part on output from OpenAI’s ChatGPT, which would have violated OpenAI’s terms of use. The output purportedly was hosted on ShareGPT, a website where users share conversations with ChatGPT. (Google denies the report.) A decade ago, Google accused Microsoft of copying its search results to enhance Bing.
Training a machine learning model on a different model’s output can be a useful technique, but it also raises engineering, business, and legal questions. When is it okay?
Engineering recipes for training learning algorithms on generated data are still being developed. When I led a large automatic speech recognition (ASR) team, there were rumors — that we never proved or disproved — that a competitor was using our system to generate transcripts to train a competing system. It was said that, rather than using our ASR system’s output directly as labeled training data, our competitor used a lightweight process to manually clean up errors and make sure the data was high-quality.
Lately, I’ve seen many developers experiment with use cases such as prompting a large model (say, 175B parameters) to generate high-quality outputs specialized to an application such as customer support, and using this data to fine-tune a smaller model (say, ~10B parameters) that costs less per inference. UC Berkeley trained Koala using data from ShareGPT, and Stanford trained Alpaca by fine-tuning Meta’s LLaMA on data generated with assistance from OpenAI’s text-davinci-003.
Such recipes raise important business questions. You may have spent a lot of effort to collect a large labeled training set, yet a competitor can use your model’s output to gain a leg up. This possibility argues that, contrary to conventional tech-business wisdom, data doesn’t always make your business more defensible. Specifically, if a market leader spent significant resources to get its performance up to a certain level, and if the market leader’s product generates data that makes it cheaper for competitors to catch up, then the market leader’s initial effort spent gathering data is a weak defense against competitors.
In addition, the legal and ethical questions around this practice need clearer answers. OpenAI’s terms of use forbid anyone to “use output from the Services to develop models that compete with OpenAI.” To my mind, this raises legal questions such as:
- If Google or another company has not agreed to OpenAI’s terms of use, and it scrapes text from ShareGPT that someone else shared, is it bound by OpenAI’s terms?
- Are terms that restrict competitor’s access to your services enforceable in light of antitrust and fair-use laws?
(To state the obvious, I am not a lawyer. Don’t construe anything I say as legal advice!)
In the era of generative AI, we’ll see many creative use cases for intentionally using one model to generate data to train another. This is an exciting technical trend, even as we keep in mind the need to move forward in ways that are legal and fair.
Keep fine-tuning!
Andrew
P.S. On Friday, April 7, Yann LeCun and I will hold a live online discussion about a proposed six-month pause in cutting-edge AI research. The proposal raises questions about AI’s future and, if implemented, would have a huge impact on developers and businesses. Please join us.