Reddit to Block Automated Scraping

Caleb Ogwuche

1 year ago

After AI startups were found scraping its website for content, Reddit announced on Tuesday that it would alter its web standard to restrict automated data scraping

The decision is made when artificial intelligence firms have been accused of plagiarizing content from publishers to produce AI-generated summaries without obtaining permission or providing credit.

Reddit announced that it would revise the Robots Exclusion Protocol, or “robots.txt,” a widely recognized standard intended to specify which website components are permissible for crawling.

The company also stated that it would continue implementing rate-limiting, a method that regulates the number of requests from a specific entity. Additionally, it will prevent unknown algorithms and crawlers from data scraping, which involves collecting and storing raw information on its website.

In recent years, robots.txt has emerged as a critical tool publishers use to prevent technology companies from using their content to train AI algorithms and generate summaries in response to specific search queries.

TollBit, a content licensing startup, wrote to publishers last week to inform them that numerous AI firms were exploiting the web standard to trawl publisher sites.

This report is the result of an investigation conducted by Wired, which revealed that Perplexity, an AI search startup, likely circumvented attempts to block its web crawler through robots.txt.

Business media publisher Forbes accused Perplexity of plagiarizing its investigative articles for use in generative AI systems without attribution earlier in June.

On Tuesday, Reddit announced that its content will remain accessible to researchers and organizations like the Internet Archive for non-commercial purposes.