According to Mashable, a detailed investigation by The Atlantic reveals that Common Crawl Foundation has been providing paywalled content from publishers like the New York Times, Wired, and Washington Post to AI companies including Google, Anthropic, OpenAI, and Meta. Common Crawl’s executive director Richard Skrenta told The Atlantic that “the robots are people too” and believes AI models should access everything on the internet, despite the foundation’s website claiming it only collects freely available webpages. Publishers have been requesting content removal since becoming aware of the scraping, with Common Crawl claiming removal processes were “50 percent, 70 percent, and then 80 percent complete” in emails to organizations. However, the investigation found none of these takedown requests appear to have been fulfilled, and Common Crawl’s archives haven’t been modified since 2016.
Common Crawl’s contradictory position
Here’s where things get really interesting. Common Crawl published a blog post strongly denying they mislead publishers or bypass paywalls. They claim their crawler only accesses publicly available pages and doesn’t do “AI’s dirty work.” But wait—they’ve received donations from OpenAI and Anthropic, list NVIDIA as a collaborator, and actively help assemble AI training datasets. So which is it? Are they just a neutral public archive or actively enabling AI companies? The evidence suggests they’re playing both sides.
The publisher’s impossible situation
News organizations are stuck between a rock and a hard place. They can block Common Crawl’s scraper going forward, but that does nothing about content already scraped over years. And according to The Atlantic’s investigation, Common Crawl’s file format is “meant to be immutable”—meaning once something’s in there, it’s basically permanent. Meanwhile, AI chatbots are directly competing with publishers by serving up their content without sending traffic back. It’s what some are calling the traffic apocalypse—and Common Crawl appears to be fueling it.
The bigger copyright war
This isn’t just about one nonprofit organization. We’re watching the opening salvos in what will likely be a decade-long legal battle over AI training data. OpenAI is already facing lawsuits from the New York Times and Ziff Davis (Mashable’s parent company). The fundamental question is whether scraping publicly accessible web content—even if it’s behind a soft paywall—constitutes fair use. Common Crawl positions itself as serving the public good, but when their data primarily benefits trillion-dollar tech companies, that argument starts to look pretty thin.
What happens now?
Common Crawl’s public stance doesn’t match the evidence, and publishers are rightfully furious. The fact that their search tool returns misleading results about what’s actually in their archives suggests they know this looks bad. Meanwhile, AI companies get to maintain plausible deniability—”we just used a public dataset!”—while benefiting from content they’d never pay for directly. This whole situation feels like the digital equivalent of “don’t ask, don’t tell.” And until courts establish clear rules or legislation catches up, this shadow ecosystem will likely continue operating in the gray areas of copyright law.
