Bots that scrape websites for AI training data often ignore do-not-crawl requests. Now web publishers can enforce such appeals by luring scrapers to AI-generated decoy pages.
What’s new: Cloudflare launched AI Labyrinth, a bot-management tool that serves fake pages to unwanted bots, wasting their computational resources and making them easier to detect. It’s currently free to Cloudflare users.
How it works: AI Labyrinth protects webpages by embedding them with hidden links to AI-generated alternatives that appear legitimate to bots but are irrelevant to the protected site.
- An unidentified open-source model that runs on Cloudflare’s Workers AI platform generates factual, science-related HTML pages on diverse topics. A pre-generation pipeline sanitizes the pages of XSS vulnerabilities before storing them in Cloudflare’s R2 storage platform.
- A custom process embeds links to decoy pages within a site’s HTML. Meta instructions hide these links from search engine indexers and other authorized crawlers, while other attributes and styling hide the decoy links from human visitors.
- When an unauthorized bot follows one of these links, it crawls through layers of irrelevant content.
- Cloudflare logs these interactions and uses the data to fingerprint culprit bots and improve its bot-detection models.
Behind the news: The robots.txt instructions that tell web crawlers which pages they can access aren’t legally binding, and web crawlers can disregard them. However, online publishers are moving to try to stop AI developers from training models on their content. Cloudflare, as the proxy server and content delivery network for nearly 20 percent of websites, plays a potentially large role in this movement. AI crawlers account for nearly 1 percent of web requests on Cloudflare’s network, the company says.
Why it matters: The latest AI models are trained on huge quantities of data gleaned from the web, which enables them to perform well enough to be widely useful. However, publishers increasingly aim to limit access to this data. AI Labyrinth gives them a new tool that raises the cost for bots that disregard instructions not to scrape web content.
We’re thinking: If AI Labyrinth gains traction, no doubt some teams that build crawlers will respond with their own AI models to sniff out its decoy pages. To the extent that the interest between crawlers and publishers is misaligned and clear, enforceable rules for crawling are lacking, this cat-and-mouse competition could go on for a long time.