Publishers push Common Crawl to stop collecting content for AI training

Digital Content Next (DCN), a U.S. trade group representing major digital publishers, sent a cease-and-desist letter to the Common Crawl Foundation on March 12, 2024, demanding it stop scraping and distributing protected publisher content. DCN, whose members include The Associated Press, The New York Times, NBC Universal, Bloomberg, NPR, and Fox, also requested the removal of its members' content from Common Crawl's datasets, which include paywalled and subscriber-only news articles. Publishers expressed concerns about the effectiveness of Common Crawl's opt-out mechanisms, with DCN's lawyers questioning whether the foundation had accurately removed content when requested, citing instances where Common Crawl indicated compliance but later cited technical challenges. DCN argues that copyright law does not operate on an opt-out basis and that Common Crawl has "flagrantly infringed" publisher copyrights by distributing datasets containing protected content without permission to companies developing AI tools. DCN CEO Jason Kint stated that the legal notice challenges the notion that online content can be freely collected and reused simply due to accessibility. Common Crawl's Executive Director Rich Skrenta denied that its bot bypasses paywalls or intentionally misled publishers, asserting that removal requests are processed promptly according to the dataset's technical design. This dispute could significantly influence the future of AI training data acquisition and publisher rights.

Publishers push Common Crawl to stop collecting content for AI training

Read next

Carlsberg Files for India IPO to Raise $700 Million

FTSE 100 Declines for Second Consecutive Day

Indian Tycoon Bets $30M on AI Office Suite Alternative

Moreau Paris Leather Goods Brand Seeks New Owner