I'm sure that plays a role, but still... This obviously is about cost and money making, not security as a whole (ime)
So their argument is that people who would be paying money at their paywalls, are going to IA to get their news for free? And if they can thwart those people, they'll show up and become monthly subscribers?
I am vaguely sympathetic to newspapers as a concept, though the actually owners of approximately all of them are just PE companies looking to extract maximum profit from this dying industry, not really trying to prolong their existence.
But I think everyone who is interested in subscribing to their newspapers' paywalls already has subscribed. Those of us who bypass paywalls with that archive.whatever site, or apparently IA (I have never tried it for this purpose) are doing so because there is zero chance we're going to (recurringly!) pay the asking price for some random out-of-town newspaper, The Verge, Bloomberg, whatever. It's fair game to call us immoral for that decision, but if (and it's a big if) this move prevents more people from being able to bypass a paywall, I predict zero incremental dollars will go to the news publishers.
It is not hard to imagine a future in 50 years time where a huge percentage of this content is lost forever, or at best incredibly hard to find.
This trend of outright banning the Internet Archive has me extremely worried. I fear a future where news articles are memoryholed, and no one can remember exactly what was reported and how sensational it all seemed.
I've been working on this project [0] for a while. Originally, I started with a tool that would allow people to snapshot webpages in their own browser, and they could selectively share their snapshots. Then by consensus, everyone could understand what exactly had changed, and they could draw their own conclusion about why.
While working on it, I realized that an authoritative answer to "what did it look like on $DATE" can't be produced by a no-name company. It's gotta be a non-commercial entity that's got a track record of integrity. The dream would be to allow MemoryHole customers to submit their snapshots to the Internet Archive (or other non-commercial entity). It's definitely a copyright nightmare - so no clue how this could work.
[0] - https://memoryhole.app
Now most of those who spend money get access to relatively good news in comparison to those who don't. The interesting thing is that if you model the utility of a customer base as trifactorial (subscriptions, ad-supported, influence-ability) and you set ad-support to near zero you're left with this situation where those with no ability to pay are now overwhelmingly useful to the website provider only as an influenceable base.
"If you're not paying, you're not the customer, you're the product", we used to say[0]. It turns out that's true, but if you can't pay by looking at ads, you will pay by the actions you take when you believe what the actual customer wants you to believe.
0: Though sometimes you do pay and you're still "the product" haha!
Redditors then had the gall to pretend like it wasn’t their number one use case.
The next natural thing to happen would be privatization or consolidation of the internet itself. Its already happening in the form of grabbing and consolidating IPv4 addresses.
But I think this will hurt them as time goes on more then help. IIRC, one news org blocked free access and their revenue fell. I think that was in Australia.
But seems they are using AI as the reason. So allowing after a week will not avoid AI access.
But, what happens of an AI Company subscribes to the news site using a person's name (or a fake name) ? They will still get the article and avoid hassles.
Obviously, a business needs to have an income but it's becoming more common for businesses to function first and foremast as revenue generators and the thing that enables that is only seen as a means to an end. When the quality of the product/service and it's function as a revenue generator diverge, the product/service will always take 2nd chair.
Maybe we could argue that the primary product is the revenue, especially when there are investors involved who are looking for big returns.
It could work as a decentralized free and open source system that doesn't care about copyright. Like how torrents work now, but it would be good to have it work over Tor or something. Perhaps as a DAO for the management aspect of it. I don't know how exactly. But disregarding copyright by using a centralized company is the wrong idea.
Or you can do the lawful approach and try to work within the framework of that copyright nightmare. But "fuck copyright" is an easier path.
https://www.uh.edu/news-events/stories/052815watchingtvracia...
https://www.mediamatters.org/legacy/video-what-happens-when-...
Historically-speaking, if your local news can twist the context to make you easier to sell to (products, services, ideologies), they will do that.
There is no media theory of information of what happens when info explodes beyond capacity of the system to consume it. (UN report on Attention Economy says less than 1% is actually consumed by humans)
So media orgs, instead of coming up with one, they just keep mindlessly doing what they know how to do - generate more info. Platforms and corps subsidize this activity for their own interests.
So media orgs have no signal/warped signals of how useless what they are doing is.
Is there a way to export/download my saves in a reasonable way?
Blocking archiving in a flailing attempt to keep AIs away is extremely shortsighted. Archiving is important for keeping historical context, especially when it comes to news and journalism.
One of the tests for Fair Use in the US, as I understand it, would be whether the archived work "competes" with the original.
If people start going to IA instead to read the news, the newspaper might have a claim. But if they're doing it to get around paywalls, or purely for archival/historical/research purposes, that may be allowed.
But the reality is such decisions are subjective and will be up to whatever judge happens to get such a case in front of them if this is challenged.
It looks like this:
├── files
│ └── 632daffb-2f4f-4795-bb4d-3149d24f4264
│ ├── original.html
│ ├── readerview.html
│ └── screenshot.png
├── manifest.json
└── metadata.csv
One possible solution that I can think of for the long term good could be to just allow archival, no retrieval of the latest information, at-least for 6 months or a year. This should theoretically allow most goals.
The torrent approach is nice. I could imagine a selfhosted way to store the data (for a group of people)
Linkwarden does this well. You can share a collection for a small group of people.
May 20, 2026, 5:03 p.m.
McClatchy, Advance Local, Tribune Publishing and other major newspaper chains are restricting the nonprofit’s archiving bots.
In January, Nieman Lab broke the story that major news publishers — including The New York Times, The Guardian, and USA Today Co. — had started blocking the Internet Archive due to concerns that AI companies might scrape the nonprofit’s repositories for training data.
No news publisher has confirmed to Nieman Lab that an AI company has already scraped their content from the Wayback Machine. Still, in the five months since we published our story the number of news sites blocking the Internet Archive has continued to rise.
Overwhelmingly, these sites are local news outlets.
Our new analysis shows that more than 340 local news sites across the United States are now limiting the Internet Archive’s ability to access and preserve their stories. Many sites in our sample are owned by five of the seven largest local news publishers in the country: USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. The latter two are both subsidiaries of the “vulture hedge fund” Alden Global Capital.
Researchers, historians, and citizens around the world rely on the web archives of local news sites to do their work.
“Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term,” Edward McCain, a journalism librarian at the University of Missouri, said. “In the present we may have some workarounds, but in the long run, it weakens a vital link in primary source materials that we need to understand where we’ve been and where we want to go.”
Working journalists are among the most frequent users of the Wayback Machine’s local news archives. Over the last month, online petitions have called for news media companies to allow the Internet Archive to preserve their journalism.
“I cover news within a larger news desert in New York’s Rockland, Sullivan, and Rockland counties. This means I need to heavily rely on archival data of old news articles from now deceased, or zombie-fied, media outlets,” wrote B.J. Mendelson, the editor of The Monroe Gazette newsletter, in one recent petition signed by over 200 journalists. “Without the Internet Archive, my [work] would be incredibly difficult to do.”
In the face of publisher concerns, the Wayback Machine has highlighted its efforts to minimize abuse of its site, including implementing systems that limit bulk downloading and working with vendors like Cloudflare to monitor bot activity. “We are in conversation with many publishers and appreciate the opportunity to address their concerns,” Mark Graham, the founder of the Wayback Machine, told Nieman Lab, noting that the Internet Archive’s terms of use only permits using its collections for scholarship or research purposes.
Meredith Broussard, a data journalist and professor at New York University, said that as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property.
“This is the same fight that everybody has been having with the Internet Archive since its inception,” Broussard said. “Internet Archive is a very old-school, ‘information-should-be-free’ organization. But the people who are invested differently have different priorities. There are lots of different historical and legal and economic issues that are colliding in this situation. AI companies [are] the catalyst for the latest skirmish in a very old battle.”
In January, Nieman Lab used journalist Ben Welsh‘s database of 1,167 news websites‘ robots.txt files to determine which sites were disallowing the Internet Archive. At the time, the Internet Archive did not respond to requests to confirm which crawling bots it was using, so we identified four bots that the AI user agent watchdog service Dark Visitors had associated with them. (You can find our full methodology here.)
We found that 241 news websites disallowed at least one Internet Archive-affiliated crawling bot. About 80% of these sites belonged to USA Today Co., the company formerly known as Gannett.
By May, we found that an additional 141 news websites disallowed at least one Internet Archive-affiliated bot, increasing the total number of sites in our sample to 382. Some of these additions appeared in Welsh’s database. We found others by checking robots.txt files ourselves. Our final sample includes sites in 10 countries, though the vast majority (93%) are based in the United States.
Of the 382 news sites in our updated sample, 342 are local. Of course, our data doesn’t include all the local news outlets in the United States, but it shows that many of the country’s largest local news publishers are at least attempting to limit Internet Archive access.
The scraping bots we tracked in our new analysis are Heritrix, My-heritrix-crawler, heritrix/3.3.0, Archive-It, archive.org_bot, ia_archiver-web.archive.org, and Special_archiver. (We included Archive-It, archive.org_bot, ia_archiver-web.archive.org, and Special_archiver in our January analysis. After confirming that the bot Heritrix and its variations belong to the Internet Archive, we added them.)
Graham told Nieman Lab that the Wayback Machine doesn’t use the bots “ia_archiver,” “ia_archiverbot” or “ia_archiver-web.archive.org.”
Third-party websites and internet forums have regularly documented “ia_archiver-web.archive.org” as an alleged user agent of the Wayback Machine. We continue to include “ia_archiver-web.archive.org” in our dataset because news publishers are disallowing the bot under the assumption that it is used by the Internet Archive.
Our full dataset can be viewed in the table below:
At least 13 Advance Local news sites, including The Cleveland Plain Dealer (Cleveland.com), The Patriot-News (PennLive.com), and The Oregonian (OregonLive.com), have added the Internet Archive’s user agents in their robots.txt files.
Advance Local — a subsidiary of Advance Publications, the Newshouse family-owned media giant — confirmed to Nieman Lab it began hard-blocking the Internet Archive last August. It took the action preemptively, without evidence that its content had been scraped by an AI company via the Wayback Machine.
“This is part of a broader effort to protect the value of our published work from unfair third‑party use. This decision is not specific to the Wayback Machine,” said Christine deWit, a spokesperson for Advance Local, in a statement.
Alden Global Capital is another major local news chain that has rolled out new restrictions on the Internet Archive. About 60 of those sites are owned by MediaNews Group, the Alden subsidiary that operates dailies across the country, including The Mercury News, the Denver Post, and the New York Daily News. Another seven publications are operated by Tribune Publishing, most notably the Chicago Tribune.
Alden has been criticized for aggressively acquiring U.S. newspapers and stripping them of resources for short-term profits. Alden did not respond to requests for comment.
In July 2025, Alden ran an editorial in more than 60 of its daily newspapers openly criticizing OpenAI and other AI companies that have used news content to train their models without compensation. “Securing permission from, and fairly compensating, those publishers who created this great foundation of knowledge is the right, just and American thing to do,” read the editorial. Both Alden publishers are part of the major copyright infringement suit against OpenAI and Microsoft that includes The New York Times and is currently winding its way through federal court.
Some independent local publishers, like The Baltimore Banner, are open to AI chatbots surfacing their stories without licensing deals. But they’re still concerned that a “back door” like the Wayback Machine’s might hurt their chances at being cited properly.
Last year, The Banner worked with the company DataDome to analyze crawler activity on its site. The findings were striking: about 25% of The Banner’s site traffic was coming from bots, including crawlers operated by the Internet Archive, according to Biswajit Ganguly, the chief technology officer and AI strategist at the Banner.
Based on that analysis, The Banner started blocking the Internet Archive, later adding one of its crawlers to its robots.txt file. It still lets major AI companies through, including crawlers used by ChatGPT and Claude.
As Ganguly explains it, the new restrictions on the Wayback Machine are less about negotiating licensing deals or preventing The Banner’s stories from appearing in AI products, and more about ensuring those products trace information back to The Banner instead of linking to sites that aggregate its work.
“We didn’t want the bots to be trained on our content, and then spit out answers based on the content without any kind of references, link, or attribution to our sources,” said Ganguly. “If ChatGPT finds something in the Wayback Machine…we were not sure how well it would be attributed back to us.”
He added that The Banner is still gathering information on how AI search products interact with news about the Baltimore region and the publication is open to lifting its block down the line.
“The threat is definitely not the Internet Archive,” Ganguly said. “But it’s a question of how the other actors are going to provide references or attributions and links back to the real creator of the content.”
Local publishers aren’t the only ones ramping up these efforts. Condé Nast, another arm of Advance Publications, has rolled out a coordinated effort to disallow the Internet Archive. Vogue, The New Yorker, Pitchfork, Vanity Fair, Bon Appetit, and Wired currently disallow four crawling bots from our list. (Last month, Wired covered the existential threat these blocks pose to the Internet Archive). Condé Nast did not respond to a request for comment.
The Atlantic has been working with Cloudflare to block the Internet Archive since last summer and added one of the Internet Archive’s crawlers to its robots.txt file in an update earlier this year, according to Anna Bross, The Atlantic’s SVP of communications. She said the decision is part of the outlet’s “aggressive” blocking policy.
“Our default is to block: No one should be scraping The Atlantic’s journalism without permission, regardless of the use,” Bross said.
The Atlantic’s CEO Nick Thompson commented on our January reporting in a video posted to LinkedIn in April. He said blocking the Internet Archive is important for publishers that want to maintain leverage when negotiating licensing with big AI companies.
“Because of the damages that can be done when you let all your content be scraped, because of all the leverage you lose, there will be worthy products that you previously gave your data to and now you can’t,” said Thompson.
Major international publishers have also started to block the Internet Archive, including the leading newspaper in Brazil, Folha de S.Paulo. Folha added three Internet Archive user agents to its robots.txt file in February.
“Folha believes that the sustainability of professional journalism — the very material the public record seeks to preserve — depends on protecting intellectual property,” said Sérgio Dávila, Folha’s editor-in-chief. “If AI companies wish to use this archive for training, they must enter into licensing agreements rather than rely on third-party repositories.”
Dávila noted that Folha invests in its own digital archive, Acervo Folha, which includes digitized editions of print issues going back to the paper’s founding in 1921. Access to Acervo Folha is available to paying subscribers.
Archiving is expensive; the technical infrastructure, storage, and expertise can be cost-prohibitive to smaller news organizations.
Before the rise of digital news, many papers maintained physical archives, often staffed with in-house librarians. Today, due to the contraction of the newspaper industry, most of those dedicated archiving roles are gone and the move to digital publishing has only complicated the issue.
A new content management system (CMS) can often lead to major archival losses. In 2024, thousands of articles vanished from the sites of the Daily Hampshire Gazette and the Greenfield Recorder in Western Massachusetts during a CMS switch. When publications close many former owners don’t want to shoulder the cost of maintaining a site. In 2022, a decade after The Hook, a Charlottesville weekly, went under, its archived site went offline, along with over 22,000 stories.
The Internet Archive is often touted as a hero of the web for taking on the Herculean task of preserving the entirety of the internet, and for stepping in when news organizations fail to preserve their own work.
In December, the Internet Archive partnered with the Poynter Institute and Investigative Reporters and Editors to train a cohort of 33 local and national news outlets on how to develop and implement an archiving strategy. The initiative, funded through a Press Forward grant, aims to train 300 newsrooms in digital preservation and in using the Internet Archive’s services by the end of 2027.
Most of the initial cohort is made up of independent and nonprofit local newsrooms, including Outlier Media, Charlottesville Tomorrow, and The 51st. Wired is the only publication in our dataset restricting Internet Archive access that is participating in the program.
As Broussard, the NYU professor, points out, while the Internet Archive is one of the few efforts to make archives free, it isn’t the only effort to archive news. News publishers have long licensed their journalism to commercial archives like ProQuest and LexisNexis, which are often available in libraries, universities, and for individual subscriptions. They’re not free, but they do exist. At least several publications in our sample appear in ProQuest databases, including the Chicago Tribune, The Baltimore Sun, Honolulu Civil Beat, and USA Today.
Economic incentives are a valid reason for publishers to want to keep their contents out of the Internet Archive, Broussard said, but news outlets should have a long-term, multifaceted preservation strategy. Even with a plan in place, the reality for many publishers is that it’s unlikely that they’ll be able to save everything.
“Every news organization, especially local news organizations, generally launch thinking, ‘we’re going to put stuff on the internet and it’s going to be there forever,’ and that’s not true,” Broussard said. “Anybody who told you the internet is forever lied.”
Correction: An earlier version of this story stated that NOLA.com was owned by Advance Local. It is currently owned by Georges Media Group.
Photo of Internet Archive servers by Scott Beal/Laughing Squid used under a Creative Commons license.
Andrew Deck is a staff writer covering AI at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email, Bluesky, or Signal (+1 203-841-6241).