Home/News/Publishers push Common Crawl to stop collecting content for AI training
Search Engine Land3 min read

Publishers push Common Crawl to stop collecting content for AI training

Digital Content Next (DCN), a U.S. trade group representing major digital publishers, sent a cease-and-desist letter to the Common Crawl Foundation on March 12, 2024, demanding it stop scraping and distributing protected publisher content. DCN, whose members include The Associated Press, The New York Times, NBC Universal, Bloomberg, NPR, and Fox, also requested the removal of its members' content from Common Crawl's datasets, which include paywalled and subscriber-only news articles. Publishers expressed concerns about the effectiveness of Common Crawl's opt-out mechanisms, with DCN's lawyers questioning whether the foundation had accurately removed content when requested, citing instances where Common Crawl indicated compliance but later cited technical challenges. DCN argues that copyright law does not operate on an opt-out basis and that Common Crawl has "flagrantly infringed" publisher copyrights by distributing datasets containing protected content without permission to companies developing AI tools. DCN CEO Jason Kint stated that the legal notice challenges the notion that online content can be freely collected and reused simply due to accessibility. Common Crawl's Executive Director Rich Skrenta denied that its bot bypasses paywalls or intentionally misled publishers, asserting that removal requests are processed promptly according to the dataset's technical design. This dispute could significantly influence the future of AI training data acquisition and publisher rights.

Original source — read the full reporting at the publisher:

Read on Search Engine Land

Read next