US Publishers Demand Common Crawl Stop Scraping Their Content via @sejournal, @MattGSouthern

Digital Content Next sent a cease and desist letter to Common Crawl on May 15, 2024, demanding the organization stop scraping publisher content and remove protected material from its datasets. The organization, representing 150 premium digital publishers including The New York Times Company and The Wall Street Journal, argues that Common Crawl's large-scale data collection for AI training infringes on copyright and devalues original content. Digital Content Next stated that Common Crawl's datasets, which include vast amounts of copyrighted text and images, are used to train generative AI models that can then produce content that directly competes with its members' work. The letter specifically requests that Common Crawl cease all future scraping of publisher websites and remove all copyrighted content from its existing datasets. This action highlights a growing tension between AI developers relying on extensive web data and content creators seeking to protect their intellectual property and revenue streams. The publishers are concerned that their investment in creating high-quality journalism and content is being exploited without compensation or permission, undermining their business models.

US Publishers Demand Common Crawl Stop Scraping Their Content via @sejournal, @MattGSouthern

Read next

Carlsberg Files for India IPO to Raise $700 Million

FTSE 100 Declines for Second Consecutive Day

Indian Tycoon Bets $30M on AI Office Suite Alternative

FBI Director Patel Missed MSTR Investment Disclosure