Web Data Increasingly Off Limits Online publishers crack down on AI training data access

Published

Jul 31, 2024

Reading time

3 min read

Online publishers are moving to stop AI developers from training models on their content.

What’s new: Researchers at MIT analyzed websites whose contents appear in widely used training datasets. Between 2023 and 2024, many of these websites changed their terms of service to ban web crawlers, restricted the pages they permit web crawlers to access, or both.

How it works: MIT’s Data Provenance Initiative examined 14,000 websites whose contents are included in three large datasets, each of which contains data from between 16 and 45 million websites: C4 (1.4 trillion text tokens from Common Crawl), RefinedWeb (3 trillion to 6 trillion text tokens plus image links), and Dolma (3 trillion text tokens).

The authors segmented each dataset into a head (2,000 websites that contributed the most tokens to each dataset) and a tail. Uniting the three heads yielded approximately 4,000 high-contribution sites (since content from some of these sites appears in more than one dataset). To represent the tail, they randomly sampled 10,000 other websites that appear in at least one dataset.
They examined each website’s terms of service and robots.txt, a text file that tells web crawlers which pages they can access, for restrictions on using the website’s content. (Robots.txt is an honor system; no mechanism exists to enforce it.)

Results: In the past year, websites responsible for half of all tokens (text scraped and encoded for use as training data) in the study changed their terms of service to forbid either crawlers in general or use of their content to train AI systems. Robots.txt files showed the same shift.

In April 2023, robots.txt files restricted less than 3 percent of tokens in the head and 1 percent of all tokens in the study. One year later, they restricted around 28 percent of tokens in the head and 5 percent of all tokens.
Some types of websites are growing more restrictive than others. In April 2023, news websites in the head used robots.txt to restrict 3 percent of their tokens. In April 2024, that number rose to 45 percent.
Websites are restricting some crawlers significantly more than others. Websites that represent more than 25 percent of tokens included in C4’s head restricted OpenAI’s crawler, but less than 5 percent of them restricted Cohere’s and Meta’s. By contrast, 1 percent restricted Google’s search crawler.

Behind the news: Data that once was freely available is becoming harder to obtain on multiple fronts. Software developers, authors, newspapers, and music labels have filed lawsuits that allege that AI developers trained systems on their data in violation of the law. OpenAI and others recently agreed to pay licensing fees to publishers for access to their material. Last year, Reddit and Stack Overflow started charging AI developers for use of their APIs.

Yes, but: The instructions in robots.txt files are not considered mandatory, and web crawlers can disregard them. Moreover, most websites have little ability to enforce their terms of use, which opens loopholes. For instance, if a site disallows one company’s crawler, the company may hire an intermediary to scrape the site.

Why it matters: AI systems rely on ample, high-quality training data to attain high performance. Restrictions on training data give developers less scope to build valuable models. In addition to affecting commercial AI developers, they may also limit research in academia and the nonprofit sector.

We’re thinking: We would prefer that AI developers be allowed to train on data that’s available on the open web. We hope that future court decisions and legislation will affirm this.

Subscribe to The Batch