News outlets are limiting the Internet Archive’s access to their journalism

That's a real shame. I am involved with some history-related projects and the number of websites which go offline is huge, and the wayback machine is incredibly helpful for unearthing these dead sites.

It is not hard to imagine a future in 50 years time where a huge percentage of this content is lost forever, or at best incredibly hard to find.

Ugh - our local paper used to have a wonderful archive, that got limited and locked down after the pandemic. IDK if they got bought out, but it's a real shame, I think some of the problem is things that used to be public information (birthdates, families, names) in hospital admissions (I found old entries of my friends parents and my own for being "in the hospital" in the newspaper for example).

I'm sure that plays a role, but still... This obviously is about cost and money making, not security as a whole (ime)

That's okay. The AI knows everything now, and forever more. Farwell IA.

There really should be a micropayments setup on the internet that's not advertising based. Let these models pay a nickel to read the article, covered by the multi trillion dollar AI blank check.

It's interesting how much we lost with the end of the advertising model (though likely its death would arrive with agentic access anyway). An unsurprising reaction to that was the advent of the widespread paywall. And in a world where every paywalled article on social media, including HN, is on an archived paywall-bypass site there was going to be a natural cat-and-mouse game. The distributed payment model of online advertising was surprisingly effective. No single person was worth very much but the aggregate of attention had a probabilistic conversion that enabled a sufficient ecosystem of news.

Now most of those who spend money get access to relatively good news in comparison to those who don't. The interesting thing is that if you model the utility of a customer base as trifactorial (subscriptions, ad-supported, influence-ability) and you set ad-support to near zero you're left with this situation where those with no ability to pay are now overwhelmingly useful to the website provider only as an influenceable base.

"If you're not paying, you're not the customer, you're the product", we used to say[0]. It turns out that's true, but if you can't pay by looking at ads, you will pay by the actions you take when you believe what the actual customer wants you to believe.

0: Though sometimes you do pay and you're still "the product" haha!

They should allow access after the news becomes old. That's what the archive is intended for.

> "as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property."

So their argument is that people who would be paying money at their paywalls, are going to IA to get their news for free? And if they can thwart those people, they'll show up and become monthly subscribers?

I am vaguely sympathetic to newspapers as a concept, though the actually owners of approximately all of them are just PE companies looking to extract maximum profit from this dying industry, not really trying to prolong their existence.

But I think everyone who is interested in subscribing to their newspapers' paywalls already has subscribed. Those of us who bypass paywalls with that archive.whatever site, or apparently IA (I have never tried it for this purpose) are doing so because there is zero chance we're going to (recurringly!) pay the asking price for some random out-of-town newspaper, The Verge, Bloomberg, whatever. It's fair game to call us immoral for that decision, but if (and it's a big if) this move prevents more people from being able to bypass a paywall, I predict zero incremental dollars will go to the news publishers.

I think its bound to happen and in some ways it a good thing to happen too. The current state of AI affairs is a lot about outrightly selling some one else's intellectual property. The short term incentives are eroding the trust and goodwill among the natural knowledge actors.

The next natural thing to happen would be privatization or consolidation of the internet itself. Its already happening in the form of grabbing and consolidating IPv4 addresses.

If we don't know the past we wont know it's repeating

Perhaps I imagined this, however some months ago on X someone pointed out a historical article on dailymail.co.uk related to Prince Phillip and Epstein had been scrubbed, which likely would be intelligence or through D-Notices, but where instead of showing a 404 page would redirect to an article that was similar but benign. I checked the URL on the Wayback Machine and it turned up zero results, but not even the redirected article, however the user on X had screen grabbed the original, which everyone was reading and commenting on. As of 21st May I can't find this discussion on X and Grok denies it ever existed. This is a "maximally truth-finding" AI, so I must be mistaken. Perhaps the Internet Archive cannot be trusted, so this is why 340 local news outlets need to limit access.

Apologies for the self-promo. Downvote and I'll know not to do it again.

This trend of outright banning the Internet Archive has me extremely worried. I fear a future where news articles are memoryholed, and no one can remember exactly what was reported and how sensational it all seemed.

I've been working on this project [0] for a while. Originally, I started with a tool that would allow people to snapshot webpages in their own browser, and they could selectively share their snapshots. Then by consensus, everyone could understand what exactly had changed, and they could draw their own conclusion about why.

While working on it, I realized that an authoritative answer to "what did it look like on $DATE" can't be produced by a no-name company. It's gotta be a non-commercial entity that's got a track record of integrity. The dream would be to allow MemoryHole customers to submit their snapshots to the Internet Archive (or other non-commercial entity). It's definitely a copyright nightmare - so no clue how this could work.

[0] - https://memoryhole.app

Maybe they should allow the Internet Archive access to their article after a week or 2.

But I think this will hurt them as time goes on more then help. IIRC, one news org blocked free access and their revenue fell. I think that was in Australia.

But seems they are using AI as the reason. So allowing after a week will not avoid AI access.

But, what happens of an AI Company subscribes to the news site using a person's name (or a fake name) ? They will still get the article and avoid hassles.

If the block is merely user agent based IA can spoof a different user agent to get these sites.

Not surprising, sites like Reddit use it to get around their paywalls.

Redditors then had the gall to pretend like it wasn’t their number one use case.

Of course they are, because they are not primarily concerned with the reporting of noteworthy events. They are most worried about profit with the secondary goal of reporting but only insofar as it serves the first goal. This is a wider trend across many industries.

Obviously, a business needs to have an income but it's becoming more common for businesses to function first and foremast as revenue generators and the thing that enables that is only seen as a means to an end. When the quality of the product/service and it's function as a revenue generator diverge, the product/service will always take 2nd chair.

Maybe we could argue that the primary product is the revenue, especially when there are investors involved who are looking for big returns.

I'm sure that plays a role, but still... This obviously is about cost and money making, not security as a whole (ime)

> "as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property."

It is not hard to imagine a future in 50 years time where a huge percentage of this content is lost forever, or at best incredibly hard to find.

This future is here already, policy makers have it locked up. Any person who remembers what microfiche is understands the magnitude of this problem of not having a trustworthy public record. If we extended public policy from the library era, the library of congress itself would be the Internet Archive.

Apologies for the self-promo. Downvote and I'll know not to do it again.

[0] - https://memoryhole.app

0: Though sometimes you do pay and you're still "the product" haha!

If we don't know the past we wont know it's repeating

That's okay. The AI knows everything now, and forever more. Farwell IA.

They should allow access after the news becomes old. That's what the archive is intended for.

Not surprising, sites like Reddit use it to get around their paywalls.

Redditors then had the gall to pretend like it wasn’t their number one use case.

If the block is merely user agent based IA can spoof a different user agent to get these sites.

> It's definitely a copyright nightmare - so no clue how this could work.

It could work as a decentralized free and open source system that doesn't care about copyright. Like how torrents work now, but it would be good to have it work over Tor or something. Perhaps as a DAO for the management aspect of it. I don't know how exactly. But disregarding copyright by using a centralized company is the wrong idea.

Or you can do the lawful approach and try to work within the framework of that copyright nightmare. But "fuck copyright" is an easier path.

I really like this also reasonable priced.

Is there a way to export/download my saves in a reasonable way?

The next natural thing to happen would be privatization or consolidation of the internet itself. Its already happening in the form of grabbing and consolidating IPv4 addresses.

> The current state of AI affairs is a lot about outrightly selling some one else's intellectual property.

Blocking archiving in a flailing attempt to keep AIs away is extremely shortsighted. Archiving is important for keeping historical context, especially when it comes to news and journalism.

There really should be a micropayments setup on the internet that's not advertising based. Let these models pay a nickel to read the article, covered by the multi trillion dollar AI blank check.

Cloudflare is trying to push for that, but every time it's mentioned people complain (because they hate Cloudflare for making them wait 2s for a captcha) and nobody proposes an alternative solution. I don't think this is going to happen, unfortunately, and the internet will get silo-ed into oblivion.

There's a river of cash flowing to the pockets of the wealthy and to the megalomaniac projects of hyperscaler, but not to drip a few pennies onto the pockets of people providing such an important public service as journalists.

Maybe they should allow the Internet Archive access to their article after a week or 2.

But I think this will hurt them as time goes on more then help. IIRC, one news org blocked free access and their revenue fell. I think that was in Australia.

But seems they are using AI as the reason. So allowing after a week will not avoid AI access.

But, what happens of an AI Company subscribes to the news site using a person's name (or a fake name) ? They will still get the article and avoid hassles.

It may be easier to convince them if the Internet Archive doesn't allow access for <period of time>. Not good for the average user now, but at least it would be archived for the future. Better than having no archive at all.

That sounds like a good idea to me.

One of the tests for Fair Use in the US, as I understand it, would be whether the archived work "competes" with the original.

If people start going to IA instead to read the news, the newspaper might have a claim. But if they're doing it to get around paywalls, or purely for archival/historical/research purposes, that may be allowed.

But the reality is such decisions are subjective and will be up to whatever judge happens to get such a case in front of them if this is challenged.

This sounds like the beginning of a story where the next odd thing is your family and friends don’t know who you are, and know one has ever heard of you.

Maybe we could argue that the primary product is the revenue, especially when there are investors involved who are looking for big returns.

More than even that, there is more news being generated than there are 3 inch chimp brains available to digest it all (even with AI busy summerizing everything) or act on it.

There is no media theory of information of what happens when info explodes beyond capacity of the system to consume it. (UN report on Attention Economy says less than 1% is actually consumed by humans)

So media orgs, instead of coming up with one, they just keep mindlessly doing what they know how to do - generate more info. Platforms and corps subsidize this activity for their own interests.

So media orgs have no signal/warped signals of how useless what they are doing is.

When it comes to the companies named here, I would argue that they have shown that reporting isn't even a secondary goal or a goal at all. Journalists don't even make that much money, but they've still gutted newsrooms very thoroughly. I assume that they already have people working on setting up an LLM connected to feeds of press releases, government announcements, public police crime reports, prominent social media accounts, etc. to create a repository of slop they can use (which will bear a vague resesmblance to 'news') without having even one reporter employed. And then they'll try to sell access to that slop feed back to the AI vendor (which hopefully won't buy it).

As good a time as any to remind people that the Southern Strategy was never really all that Southern:

https://www.uh.edu/news-events/stories/052815watchingtvracia...

https://www.mediamatters.org/legacy/video-what-happens-when-...

Historically-speaking, if your local news can twist the context to make you easier to sell to (products, services, ideologies), they will do that.

> It's definitely a copyright nightmare - so no clue how this could work.

Or you can do the lawful approach and try to work within the framework of that copyright nightmare. But "fuck copyright" is an easier path.

Tor is a honeypot run my government intel operations. Don't use it.

You - as a company - can just avoid any copyright stuff when your extension saves the stuff only on the client. I see there are many other issues then.

The torrent approach is nice. I could imagine a selfhosted way to store the data (for a group of people)

As good a time as any to remind people that the Southern Strategy was never really all that Southern:

https://www.uh.edu/news-events/stories/052815watchingtvracia...

https://www.mediamatters.org/legacy/video-what-happens-when-...

Historically-speaking, if your local news can twist the context to make you easier to sell to (products, services, ideologies), they will do that.

More than even that, there is more news being generated than there are 3 inch chimp brains available to digest it all (even with AI busy summerizing everything) or act on it.

So media orgs, instead of coming up with one, they just keep mindlessly doing what they know how to do - generate more info. Platforms and corps subsidize this activity for their own interests.

So media orgs have no signal/warped signals of how useless what they are doing is.

I really like this also reasonable priced.

Is there a way to export/download my saves in a reasonable way?

Thank you! Yes, you just get a zip file with all of your saved pages.

It looks like this:

├── files

│ └── 632daffb-2f4f-4795-bb4d-3149d24f4264

│ ├── original.html

│ ├── readerview.html

│ └── screenshot.png

├── manifest.json

└── metadata.csv

> The current state of AI affairs is a lot about outrightly selling some one else's intellectual property.

Blocking archiving in a flailing attempt to keep AIs away is extremely shortsighted. Archiving is important for keeping historical context, especially when it comes to news and journalism.

There is a natural flow of information that allows the information producers to make money for their work. How do you expect that the information producers would be even able to continue to create information when the they are not getting paid anymore.

One possible solution that I can think of for the long term good could be to just allow archival, no retrieval of the latest information, at-least for 6 months or a year. This should theoretically allow most goals.

This sounds like the beginning of a story where the next odd thing is your family and friends don’t know who you are, and know one has ever heard of you.

Yeah IA needs to get their heads out of their asses and just do that. It's an archive, but if it's available at the same time as it's relevant, then it's being used as alternate access.

That sounds like a good idea to me.

One of the tests for Fair Use in the US, as I understand it, would be whether the archived work "competes" with the original.

But the reality is such decisions are subjective and will be up to whatever judge happens to get such a case in front of them if this is challenged.

In general judges seem to understand that the copyright holder has some interest in these situations but not seem to understand that the rest of the community has some rights too.

Thank you! Yes, you just get a zip file with all of your saved pages.

It looks like this:

├── files

│ └── 632daffb-2f4f-4795-bb4d-3149d24f4264

│ ├── original.html

│ ├── readerview.html

│ └── screenshot.png

├── manifest.json

└── metadata.csv

Yeah IA needs to get their heads out of their asses and just do that. It's an archive, but if it's available at the same time as it's relevant, then it's being used as alternate access.

In general judges seem to understand that the copyright holder has some interest in these situations but not seem to understand that the rest of the community has some rights too.

You - as a company - can just avoid any copyright stuff when your extension saves the stuff only on the client. I see there are many other issues then.

The torrent approach is nice. I could imagine a selfhosted way to store the data (for a group of people)

> I could imagine a selfhosted way to store the data (for a group of people)

Linkwarden does this well. You can share a collection for a small group of people.

https://github.com/linkwarden/linkwarden

Tor is a honeypot run my government intel operations. Don't use it.

Please provide evidence for such strong claims. Otherwise it's just FUD.

> I could imagine a selfhosted way to store the data (for a group of people)

Linkwarden does this well. You can share a collection for a small group of people.

https://github.com/linkwarden/linkwarden

Please provide evidence for such strong claims. Otherwise it's just FUD.

May 20, 2026, 5:03 p.m.

McClatchy, Advance Local, Tribune Publishing and other major newspaper chains are restricting the nonprofit’s archiving bots.

In January, Nieman Lab broke the story that major news publishers — including The New York Times, The Guardian, and USA Today Co. — had started blocking the Internet Archive due to concerns that AI companies might scrape the nonprofit’s repositories for training data.

No news publisher has confirmed to Nieman Lab that an AI company has already scraped their content from the Wayback Machine. Still, in the five months since we published our story the number of news sites blocking the Internet Archive has continued to rise.

Overwhelmingly, these sites are local news outlets.

Our new analysis shows that more than 340 local news sites across the United States are now limiting the Internet Archive’s ability to access and preserve their stories. Many sites in our sample are owned by five of the seven largest local news publishers in the country: USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. The latter two are both subsidiaries of the “vulture hedge fund” Alden Global Capital.

Researchers, historians, and citizens around the world rely on the web archives of local news sites to do their work.

“Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term,” Edward McCain, a journalism librarian at the University of Missouri, said. “In the present we may have some workarounds, but in the long run, it weakens a vital link in primary source materials that we need to understand where we’ve been and where we want to go.”

Working journalists are among the most frequent users of the Wayback Machine’s local news archives. Over the last month, online petitions have called for news media companies to allow the Internet Archive to preserve their journalism.

“I cover news within a larger news desert in New York’s Rockland, Sullivan, and Rockland counties. This means I need to heavily rely on archival data of old news articles from now deceased, or zombie-fied, media outlets,” wrote B.J. Mendelson, the editor of The Monroe Gazette newsletter, in one recent petition signed by over 200 journalists. “Without the Internet Archive, my [work] would be incredibly difficult to do.”

In the face of publisher concerns, the Wayback Machine has highlighted its efforts to minimize abuse of its site, including implementing systems that limit bulk downloading and working with vendors like Cloudflare to monitor bot activity. “We are in conversation with many publishers and appreciate the opportunity to address their concerns,” Mark Graham, the founder of the Wayback Machine, told Nieman Lab, noting that the Internet Archive’s terms of use only permits using its collections for scholarship or research purposes.

Meredith Broussard, a data journalist and professor at New York University, said that as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property.

“This is the same fight that everybody has been having with the Internet Archive since its inception,” Broussard said. “Internet Archive is a very old-school, ‘information-should-be-free’ organization. But the people who are invested differently have different priorities. There are lots of different historical and legal and economic issues that are colliding in this situation. AI companies [are] the catalyst for the latest skirmish in a very old battle.”

In January, Nieman Lab used journalist Ben Welsh‘s database of 1,167 news websites‘ robots.txt files to determine which sites were disallowing the Internet Archive. At the time, the Internet Archive did not respond to requests to confirm which crawling bots it was using, so we identified four bots that the AI user agent watchdog service Dark Visitors had associated with them. (You can find our full methodology here.)

We found that 241 news websites disallowed at least one Internet Archive-affiliated crawling bot. About 80% of these sites belonged to USA Today Co., the company formerly known as Gannett.

By May, we found that an additional 141 news websites disallowed at least one Internet Archive-affiliated bot, increasing the total number of sites in our sample to 382. Some of these additions appeared in Welsh’s database. We found others by checking robots.txt files ourselves. Our final sample includes sites in 10 countries, though the vast majority (93%) are based in the United States.

Of the 382 news sites in our updated sample, 342 are local. Of course, our data doesn’t include all the local news outlets in the United States, but it shows that many of the country’s largest local news publishers are at least attempting to limit Internet Archive access.

The scraping bots we tracked in our new analysis are Heritrix, My-heritrix-crawler, heritrix/3.3.0, Archive-It, archive.org_bot, ia_archiver-web.archive.org, and Special_archiver. (We included Archive-It, archive.org_bot, ia_archiver-web.archive.org, and Special_archiver in our January analysis. After confirming that the bot Heritrix and its variations belong to the Internet Archive, we added them.)

Graham told Nieman Lab that the Wayback Machine doesn’t use the bots “ia_archiver,” “ia_archiverbot” or “ia_archiver-web.archive.org.”

Third-party websites and internet forums have regularly documented “ia_archiver-web.archive.org” as an alleged user agent of the Wayback Machine. We continue to include “ia_archiver-web.archive.org” in our dataset because news publishers are disallowing the bot under the assumption that it is used by the Internet Archive.

Our full dataset can be viewed in the table below:

“The threat is definitely not the Internet Archive”

At least 13 Advance Local news sites, including The Cleveland Plain Dealer (Cleveland.com), The Patriot-News (PennLive.com), and The Oregonian (OregonLive.com), have added the Internet Archive’s user agents in their robots.txt files.

Advance Local — a subsidiary of Advance Publications, the Newshouse family-owned media giant — confirmed to Nieman Lab it began hard-blocking the Internet Archive last August. It took the action preemptively, without evidence that its content had been scraped by an AI company via the Wayback Machine.

“This is part of a broader effort to protect the value of our published work from unfair third‑party use. This decision is not specific to the Wayback Machine,” said Christine deWit, a spokesperson for Advance Local, in a statement.

Alden Global Capital is another major local news chain that has rolled out new restrictions on the Internet Archive. About 60 of those sites are owned by MediaNews Group, the Alden subsidiary that operates dailies across the country, including The Mercury News, the Denver Post, and the New York Daily News. Another seven publications are operated by Tribune Publishing, most notably the Chicago Tribune.

Alden has been criticized for aggressively acquiring U.S. newspapers and stripping them of resources for short-term profits. Alden did not respond to requests for comment.

In July 2025, Alden ran an editorial in more than 60 of its daily newspapers openly criticizing OpenAI and other AI companies that have used news content to train their models without compensation. “Securing permission from, and fairly compensating, those publishers who created this great foundation of knowledge is the right, just and American thing to do,” read the editorial. Both Alden publishers are part of the major copyright infringement suit against OpenAI and Microsoft that includes The New York Times and is currently winding its way through federal court.

Some independent local publishers, like The Baltimore Banner, are open to AI chatbots surfacing their stories without licensing deals. But they’re still concerned that a “back door” like the Wayback Machine’s might hurt their chances at being cited properly.

Last year, The Banner worked with the company DataDome to analyze crawler activity on its site. The findings were striking: about 25% of The Banner’s site traffic was coming from bots, including crawlers operated by the Internet Archive, according to Biswajit Ganguly, the chief technology officer and AI strategist at the Banner.

Based on that analysis, The Banner started blocking the Internet Archive, later adding one of its crawlers to its robots.txt file. It still lets major AI companies through, including crawlers used by ChatGPT and Claude.

As Ganguly explains it, the new restrictions on the Wayback Machine are less about negotiating licensing deals or preventing The Banner’s stories from appearing in AI products, and more about ensuring those products trace information back to The Banner instead of linking to sites that aggregate its work.

“We didn’t want the bots to be trained on our content, and then spit out answers based on the content without any kind of references, link, or attribution to our sources,” said Ganguly. “If ChatGPT finds something in the Wayback Machine…we were not sure how well it would be attributed back to us.”

He added that The Banner is still gathering information on how AI search products interact with news about the Baltimore region and the publication is open to lifting its block down the line.

“The threat is definitely not the Internet Archive,” Ganguly said. “But it’s a question of how the other actors are going to provide references or attributions and links back to the real creator of the content.”

Blocking as leverage for payment

Local publishers aren’t the only ones ramping up these efforts. Condé Nast, another arm of Advance Publications, has rolled out a coordinated effort to disallow the Internet Archive. Vogue, The New Yorker, Pitchfork, Vanity Fair, Bon Appetit, and Wired currently disallow four crawling bots from our list. (Last month, Wired covered the existential threat these blocks pose to the Internet Archive). Condé Nast did not respond to a request for comment.

The Atlantic has been working with Cloudflare to block the Internet Archive since last summer and added one of the Internet Archive’s crawlers to its robots.txt file in an update earlier this year, according to Anna Bross, The Atlantic’s SVP of communications. She said the decision is part of the outlet’s “aggressive” blocking policy.

“Our default is to block: No one should be scraping The Atlantic’s journalism without permission, regardless of the use,” Bross said.

The Atlantic’s CEO Nick Thompson commented on our January reporting in a video posted to LinkedIn in April. He said blocking the Internet Archive is important for publishers that want to maintain leverage when negotiating licensing with big AI companies.

“Because of the damages that can be done when you let all your content be scraped, because of all the leverage you lose, there will be worthy products that you previously gave your data to and now you can’t,” said Thompson.

Major international publishers have also started to block the Internet Archive, including the leading newspaper in Brazil, Folha de S.Paulo. Folha added three Internet Archive user agents to its robots.txt file in February.

“Folha believes that the sustainability of professional journalism — the very material the public record seeks to preserve — depends on protecting intellectual property,” said Sérgio Dávila, Folha’s editor-in-chief. “If AI companies wish to use this archive for training, they must enter into licensing agreements rather than rely on third-party repositories.”

Dávila noted that Folha invests in its own digital archive, Acervo Folha, which includes digitized editions of print issues going back to the paper’s founding in 1921. Access to Acervo Folha is available to paying subscribers.

What can be done?

Archiving is expensive; the technical infrastructure, storage, and expertise can be cost-prohibitive to smaller news organizations.

Before the rise of digital news, many papers maintained physical archives, often staffed with in-house librarians. Today, due to the contraction of the newspaper industry, most of those dedicated archiving roles are gone and the move to digital publishing has only complicated the issue.

A new content management system (CMS) can often lead to major archival losses. In 2024, thousands of articles vanished from the sites of the Daily Hampshire Gazette and the Greenfield Recorder in Western Massachusetts during a CMS switch. When publications close many former owners don’t want to shoulder the cost of maintaining a site. In 2022, a decade after The Hook, a Charlottesville weekly, went under, its archived site went offline, along with over 22,000 stories.

The Internet Archive is often touted as a hero of the web for taking on the Herculean task of preserving the entirety of the internet, and for stepping in when news organizations fail to preserve their own work.

In December, the Internet Archive partnered with the Poynter Institute and Investigative Reporters and Editors to train a cohort of 33 local and national news outlets on how to develop and implement an archiving strategy. The initiative, funded through a Press Forward grant, aims to train 300 newsrooms in digital preservation and in using the Internet Archive’s services by the end of 2027.

Most of the initial cohort is made up of independent and nonprofit local newsrooms, including Outlier Media, Charlottesville Tomorrow, and The 51st. Wired is the only publication in our dataset restricting Internet Archive access that is participating in the program.

As Broussard, the NYU professor, points out, while the Internet Archive is one of the few efforts to make archives free, it isn’t the only effort to archive news. News publishers have long licensed their journalism to commercial archives like ProQuest and LexisNexis, which are often available in libraries, universities, and for individual subscriptions. They’re not free, but they do exist. At least several publications in our sample appear in ProQuest databases, including the Chicago Tribune, The Baltimore Sun, Honolulu Civil Beat, and USA Today.

Economic incentives are a valid reason for publishers to want to keep their contents out of the Internet Archive, Broussard said, but news outlets should have a long-term, multifaceted preservation strategy. Even with a plan in place, the reality for many publishers is that it’s unlikely that they’ll be able to save everything.

“Every news organization, especially local news organizations, generally launch thinking, ‘we’re going to put stuff on the internet and it’s going to be there forever,’ and that’s not true,” Broussard said. “Anybody who told you the internet is forever lied.”

Correction: An earlier version of this story stated that NOLA.com was owned by Advance Local. It is currently owned by Georges Media Group.

Photo of Internet Archive servers by Scott Beal/Laughing Squid used under a Creative Commons license.

Andrew Deck is a staff writer covering AI at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email, Bluesky, or Signal (+1 203-841-6241).