Internet Archive 4 min read

340 Local Newspapers Just Locked Out the Internet Archive

The average webpage lives about 100 days. That’s a generous estimate. And the institution that’s spent three decades fighting that decay just lost access to a huge chunk of American local journalism — more than 340 regional news sites have quietly blocked the Internet Archive’s crawler. The question isn’t academic anymore. When today’s reporting won’t exist tomorrow, who owns the public record?

What actually happened

The Internet Archive, founded by Brewster Kahle in 1996, runs the Wayback Machine — the closest thing the web has to a national library. It currently holds snapshots of more than 916 billion pages.

Recent reporting shows that roughly 340 US news sites have added entries to their robots.txt files blocking ia_archiver. This isn’t a handful of cranky publishers. It includes papers owned by Gannett, Tribune Publishing, and McClatchy — three of the largest local newspaper chains in the country. When the giants move together, it’s a policy, not a coincidence.

Why publishers are pulling up the drawbridge

Three reasons keep coming up, and each has some logic to it.

AI training data. Publishers don’t want their archives feeding the next GPT or Claude without compensation. Given that the Internet Archive’s corpus has been used in various LLM training pipelines, blocking the crawler is a blunt but effective lever. The New York Times is suing OpenAI over exactly this issue; smaller papers are choosing prevention over litigation.

Paywall protection. If you charge for content, the Wayback Machine looks suspiciously like a free back door. Reddit threads cheerfully passing around archive.org links to bypass paywalls don’t help the publishers’ mood.

Liability and the unfixable past. Once a story is in the Wayback Machine, retractions and corrections don’t follow. The original lives forever. For papers worried about defamation suits or just the modern expectation that mistakes should be quietly memory-holed, that permanence is a feature for historians and a bug for legal departments.

What we lose when local archives go dark

Local newspapers aren’t just news. They’re the primary source material for everything that happens below the national radar: city council votes, court reporting, obituaries, the long tail of who-did-what-where. Historians, genealogists, civil rights lawyers, and accountability journalists all lean on these archives.

And local news is already in freefall. The US has lost roughly 2,900 newspapers since 2005, with two more vanishing every week, according to research from Northwestern’s Medill School. “News deserts” is no longer a metaphor — it’s a measurable geography. When a paper folds and the domain expires, its website evaporates. Without the Wayback Machine, decades of community history go with it.

The Internet Archive’s own bad year

The crawler blocks aren’t happening in a vacuum. The Archive lost a major copyright case to book publishers in 2024 over its emergency library lending program. The same year, it was hit with a massive DDoS attack and a data breach exposing 31 million user records.

The deeper problem: a single nonprofit in San Francisco has been doing the work of a national library system, on a shoestring, for thirty years. That was always fragile. Now it’s getting fragile in public.

Who should actually be doing this?

This is where the US falls behind. The UK’s British Library, France’s BnF, and Korea’s National Library (via the OASIS project) all run statutory web archiving programs — they’re required by law to preserve the national web. The US Library of Congress does some of this, but the scope is narrow.

Hacker News commenters keep pointing out the obvious: a nonprofit shouldn’t be the sole backstop for the country’s digital memory. But that’s been the arrangement, by default, since 1996.

The takeaway

Publishers blocking the Archive are making rational individual choices. The sum of those choices is erasing the public record. That’s not a technical problem — no clever tool fixes it. It’s a question of who we think is responsible for remembering, and we haven’t answered it.

Try this: pick a local news story you read five years ago and try to find it now. The result will tell you more about the state of digital preservation than any policy paper.

Internet Archive journalism digital preservation local news Wayback Machine

Comments

    Loading comments...