The New York Times, CNN, USA Today, The Guardian, and at least 241 other news organisations across nine countries have moved to restrict the Archive’s crawlers, a decision the Archive’s own director has called being ‘collateral damage’ in a war that is not really about them.
The Internet Archive has preserved more than one trillion web pages since 1996. Courts cite it. Journalists use it to prove articles were edited after publication. Historians treat it as a primary source. It is, by most measures, one of the most significant public information infrastructure projects of the internet era.
And it is now being systematically blocked by the news publishers whose work it has preserved, because of a problem those publishers are genuinely not wrong about: AI companies are using archived news content to train models without permission or payment.
According to an analysis by AI-detection startup Originality AI, 23 major news publications are blocking ia_archiverbot, the main web crawler the Internet Archive uses for the Wayback Machine.
In total, 241 news sites across nine countries explicitly disallow at least one of the Archive’s four crawling bots. USA Today Co., the largest newspaper publisher in the US, accounts for a large share of the blocked sites, effectively removing hundreds of local publications from the historical record.
The New York Times implemented what Wayback Machine director Mark Graham described as a ‘hard block’ starting in late 2025.
The news organisations’ argument is coherent even if its consequences are troubling. AI companies training large language models need vast quantities of high-quality text.
Archived news content is exactly that: structured, dated, attributed, high-quality writing accumulated over decades. The Internet Archive’s Wayback Machine makes enormous quantities of that content accessible via API and URL interface, an ideal source for model training pipelines.
A 2023 Washington Post analysis found that data from the Internet Archive had appeared in major AI training datasets. For publishers already engaged in copyright lawsuits against OpenAI, Perplexity, and others, the Archive is a gap in their defences.
“The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us,” said Graham James, a Times spokesperson.
“The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission.”
The Guardian, which has been more cautious, limited rather than fully blocked the Archive’s access after its own logs revealed the Archive was a frequent crawler.
Robert Hahn, head of business affairs at The Guardian, expressed particular concern about the Archive’s APIs. “A lot of these AI businesses are looking for readily available, structured databases of content,” he said. “The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.”
Mark Graham, the Wayback Machine’s director, has been consistent in calling this situation exactly what it is. “We are collateral damage,” he said.
The Archive has taken steps of its own: it rate-limits bulk downloads, blocks or prevents bulk downloading of certain sites’ material, and maintains controls to limit large-scale automated extraction.
Graham argues this means the publishers’ rationale for blocking the Archive’s crawlers is “unfounded”, the risk is from AI companies accessing archived material through the Archive’s interfaces, which the Archive itself controls and limits, not from the Archive crawling and preserving the content in the first place.
The Archive has also been actively in dialogue with publishers to find workable arrangements. The Guardian itself said it has been “working directly with the Internet Archive” to implement its access limits, rather than imposing a unilateral hard block.
But the Archive’s position, that it is a neutral preservation institution, not an AI training pipeline, does not fully resolve the publishers’ concern that third parties can access its data regardless of the Archive’s own intentions.
The problem with the publishers’ response is that the instrument they are using, blocking the Archive’s crawlers. has consequences that extend far beyond AI companies.
When a news article is no longer archived, it becomes editable without accountability. Publishers can and do quietly amend stories after publication: correcting errors, softening claims, removing quotes.
The Wayback Machine has been the primary tool journalists use to document those changes. The Electronic Frontier Foundation’s Joe Mullin put the stakes bluntly:
“The Internet Archive often becomes the only source for seeing those changes. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”
Wikipedia links to over 2.6 million news articles preserved by the Wayback Machine across 249 languages. Courts have used archived pages as evidence. Journalists have used them to prove government agencies changed official statements after publication.
USA Today Co.’s decision to block access has effectively removed hundreds of local newspapers from the historical record, at a moment when local journalism is already in crisis, and every preserved article represents documentation that may not exist anywhere else.
A petition organised by Fight for the Future, signed by over 100 working journalists, has pushed back against the blocking trend, describing the Wayback Machine as a tool that “preserves the public record at a time where many major media outlets are questioning whether to allow it to do so.
The Nieman Lab reported the petition in mid-April; the dispute is now escalating rather than resolving.
Yet, the Wayback Machine dispute is a compressed version of a structural problem that runs through the entire AI copyright debate. The institutions designed to serve the public interest, a digital library, open web standards, publicly accessible archives, are becoming the path of least resistance for AI companies seeking training data, because the AI companies’ direct scraping is increasingly being blocked, litigated, and metered.
The result is that the more publishers and rights holders resist AI training directly, the more pressure accumulates on the public infrastructure they cannot control.
As Michael Nelson, a computer scientist at Old Dominion University, told Nieman Lab: “Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI. In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”
The EFF concludes that the right response is not to block the Archive but to sue the AI companies directly.
“There are real disputes over AI training that must be resolved in courts.”
The publishers have, in fact, done exactly that: the Times’ lawsuit against OpenAI is proceeding. But they appear to have concluded that waiting for courts to resolve those disputes is too slow, and are taking the faster, blunter option of blocking the Archive in the meantime.


