The Internet Archive has been a cornerstone of digital preservation since 1996, archiving over a trillion web pages. Courts rely on its Wayback Machine to verify webpage history; journalists use it to track edits; historians treat it as a primary source. Yet today, this public infrastructure is being systematically blocked by the very publishers whose work it has preserved.
According to an analysis by AI-detection startup Originality AI, 23 major news publications now block ia_archiverbot, the Archive’s primary crawler. In total, 241 news sites across nine countries explicitly disallow at least one of the Archive’s four crawling bots. USA Today Co., the largest newspaper publisher in the U.S., accounts for a large share of blocked sites, effectively erasing hundreds of local publications from the historical record.
The Core Conflict: Archiving vs. AI Training
The publishers’ rationale is rooted in the rise of generative AI. Training large language models requires vast quantities of high-quality, structured text. Archived news content—dated, attributed, and professionally written—is ideal. The Wayback Machine exposes that content via APIs and URL interfaces, making it an attractive data source for AI companies.
The New York Times implemented what Wayback Machine director Mark Graham called a “hard block” starting in late 2025. Graham James, a Times spokesperson, stated: “The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us. The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission.”
Similarly, The Guardian limited rather than fully blocked the Archive after discovering it was a frequent crawler. Robert Hahn, head of business affairs at The Guardian, explained: “A lot of these AI businesses are looking for readily available, structured databases of content. The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.”
The Archive’s Defense and Countermeasures
Mark Graham has been consistent in calling this situation exactly what it is: “We are collateral damage.” The Archive argues that it is a neutral preservation institution, not an AI training pipeline. It has taken steps to prevent bulk extraction: rate-limiting downloads, blocking automated bulk operations, and maintaining controls to limit large-scale extraction via its interfaces.
The Archive also maintains active dialogue with publishers. The Guardian itself said it has been “working directly with the Internet Archive” to implement access limits rather than imposing a unilateral hard block. Still, the Archive’s assurances do not fully resolve the publishers’ concern that third parties can access its data regardless of the Archive’s own intentions.
Collateral Damage to the Public Record
The instrument publishers are using—blocking the Archive’s crawlers—has consequences that extend far beyond AI companies. When a news article is no longer archived, it becomes editable without accountability. Publishers can and do quietly amend stories after publication: correcting errors, softening claims, removing quotes. The Wayback Machine has been the primary tool journalists use to document those changes.
The Electronic Frontier Foundation’s Joe Mullin put the stakes bluntly: “The Internet Archive often becomes the only source for seeing those changes. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”
Wikipedia links to over 2.6 million news articles preserved by the Wayback Machine across 249 languages. Courts have used archived pages as evidence. Journalists have exposed government agencies that changed official statements after publication. USA Today Co.’s decision to block access has effectively removed hundreds of local newspapers from the historical record—at a moment when local journalism is already in crisis, and every preserved article represents documentation that may not exist anywhere else.
Petition and Pushback
A petition organized by Fight for the Future, signed by over 100 working journalists, pushes back against the blocking trend. It describes the Wayback Machine as a tool that “preserves the public record at a time where many major media outlets are questioning whether to allow it to do so.” The Nieman Lab reported the petition in mid-April; the dispute is escalating rather than resolving.
Michael Nelson, a computer scientist at Old Dominion University, told Nieman Lab: “Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI. In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”
The EFF concludes that the right response is not to block the Archive but to sue the AI companies directly. “There are real disputes over AI training that must be resolved in courts.” The publishers have, in fact, done exactly that: the Times’ lawsuit against OpenAI is proceeding. But they appear to have concluded that waiting for courts to resolve those disputes is too slow, and are taking the faster, blunter option of blocking the Archive in the meantime.
This dispute is a compressed version of a structural problem that runs through the entire AI copyright debate. The institutions designed to serve the public interest—a digital library, open web standards, publicly accessible archives—are becoming the path of least resistance for AI companies seeking training data, because the AI companies’ direct scraping is increasingly being blocked, litigated, and metered. The result is that the more publishers and rights holders resist AI training directly, the more pressure accumulates on the public infrastructure they cannot control. The fate of the Wayback Machine may ultimately become a cautionary tale about how protecting intellectual property in the age of AI can inadvertently erode the very foundations of an open, accountable internet.
Source: TNW | Media News