US Publishers Demand Common Crawl Stop Scraping Their Content via @sejournal, @MattGSouthern
Digital Content Next sent a cease and desist letter to Common Crawl on May 15, 2024, demanding the organization stop scraping publisher content and remove protected material from its datasets. The organization, representing 150 premium digital publishers including The New York Times Company and The Wall Street Journal, argues that Common Crawl's large-scale data collection for AI training infringes on copyright and devalues original content. Digital Content Next stated that Common Crawl's datasets, which include vast amounts of copyrighted text and images, are used to train generative AI models that can then produce content that directly competes with its members' work. The letter specifically requests that Common Crawl cease all future scraping of publisher websites and remove all copyrighted content from its existing datasets. This action highlights a growing tension between AI developers relying on extensive web data and content creators seeking to protect their intellectual property and revenue streams. The publishers are concerned that their investment in creating high-quality journalism and content is being exploited without compensation or permission, undermining their business models.
Original source — read the full reporting at the publisher:
Read on Search Engine Journal