Common Crawl Accused of Feeding Paywalled News to AI Giants

According to Mashable, a detailed investigation by The Atlantic reveals that Common Crawl Foundation has been providing paywalled content from publishers like the New York Times, Wired, and Washington Post to AI companies including Google, Anthropic, OpenAI, and Meta. Common Crawl’s executive director Richard Skrenta told The Atlantic that “the robots are people too” and believes AI models should access everything on the internet, despite the foundation’s website claiming it only collects freely available webpages. Publishers have been requesting content removal since becoming aware of the scraping, with Common Crawl claiming removal processes were “50 percent, 70 percent, and then 80 percent complete” in emails to organizations. However, the investigation found none of these takedown requests appear to have been fulfilled, and Common Crawl’s archives haven’t been modified since 2016.

Sponsored content — provided for informational and promotional purposes.

Common Crawl’s contradictory position

Here’s where things get really interesting. Common Crawl published a blog post strongly denying they mislead publishers or bypass paywalls. They claim their crawler only accesses publicly available pages and doesn’t do “AI’s dirty work.” But wait—they’ve received donations from OpenAI and Anthropic, list NVIDIA as a collaborator, and actively help assemble AI training datasets. So which is it? Are they just a neutral public archive or actively enabling AI companies? The evidence suggests they’re playing both sides.

The publisher’s impossible situation

News organizations are stuck between a rock and a hard place. They can block Common Crawl’s scraper going forward, but that does nothing about content already scraped over years. And according to The Atlantic’s investigation, Common Crawl’s file format is “meant to be immutable”—meaning once something’s in there, it’s basically permanent. Meanwhile, AI chatbots are directly competing with publishers by serving up their content without sending traffic back. It’s what some are calling the traffic apocalypse—and Common Crawl appears to be fueling it.

The bigger copyright war

This isn’t just about one nonprofit organization. We’re watching the opening salvos in what will likely be a decade-long legal battle over AI training data. OpenAI is already facing lawsuits from the New York Times and Ziff Davis (Mashable’s parent company). The fundamental question is whether scraping publicly accessible web content—even if it’s behind a soft paywall—constitutes fair use. Common Crawl positions itself as serving the public good, but when their data primarily benefits trillion-dollar tech companies, that argument starts to look pretty thin.

What happens now?

Common Crawl’s public stance doesn’t match the evidence, and publishers are rightfully furious. The fact that their search tool returns misleading results about what’s actually in their archives suggests they know this looks bad. Meanwhile, AI companies get to maintain plausible deniability—”we just used a public dataset!”—while benefiting from content they’d never pay for directly. This whole situation feels like the digital equivalent of “don’t ask, don’t tell.” And until courts establish clear rules or legislation catches up, this shadow ecosystem will likely continue operating in the gray areas of copyright law.

Researchers have developed GraphComm, a graph-based deep learning method that predicts cell-cell communication from single-cell RNA sequencing data. The approach integrates ligand-receptor annotations with expression data to map interaction networks across biological systems. Validation studies demonstrate its utility in identifying communication patterns in embryonic development, cancer drug response, and spatial microenvironments.

New Computational Framework Decodes Cellular Communication

Scientists have developed a novel graph-based deep learning method that reportedly predicts cell-cell communication (CCC) from single-cell RNA sequencing data, according to research published in Scientific Reports. The method, called GraphComm, leverages detailed ligand-receptor annotations alongside expression values and intracellular signaling information to construct interaction networks that can prioritize multiple interactions simultaneously.