e.g. IA will publish out signed https requests with their key so you, as the site owner, can confirm that it is indeed from them and not from AI.
Feels like that would be very anti open internet but not sure how else you would prove who is a good actor vs not (from your perspective that is).
How about thinking about your mission and take an anti-AI hardliner stance? But I see multiple corporate sponsors that would not be pleased:
All these so called freedom organizations like the OSI and the EFF have been bought and are entirely irrelevant if not harmful.
Maybe it's time to admit that the techie community has a pretty bad moral compass and that we're not good stewards of the world's knowledge. We turn lofty ideals into amoral money-making schemes whenever we can. I'm not sure that the EFF's role in this is all that positive. They come from a good place, but they ultimately aid a morally bankrupt industry. I don't want archive.org to retain a copy of everyone's online footprint because I know it be used the same way it always is: to make money off other people's labor and to and erode privacy.
Unless you love walled gardens, doomscrolling and endless AI slop that seems like the fun is over
Blocking certain JA3 hashes has so far been the most effective counter measures. However I wish there was an nginx wrapper around hugin-net that could help me do TCP fingerprinting as well. As I do not know rust and feel terrified of asking an LLM to make it. There is also a race condition issue with that approach, as it is passive fingerprinting even the JA4 hashes won’t be available for the first connection, and the AI crawlers I’ve seen do one request per IP so you don’t get a chance to block the second request (never happens).
Had they never existed, it had likely not made a dent to the AI development - completely like believing that had they been twice as productive, it had likely neither made a dent to the quality of LLMs.
I don't necessarily believe that we won't find some half-successful solution that will allow server hosting to be done as it currently is, but I'm not very sure that I'll want to participate in whatever schemes come about from it, so I'm thinking more about how I can avoid those schemes rather than insisting that they won't exist/work.
The prevailing thought is that if it's not possible now, it won't be long before a human browser will be indistinguishable from an LLM agent. They can start a GUI session, open a browser, navigate to your page, snapshot from the OS level and backwork your content from the snapshot, or use the browser dev tools or whatever to scrape your page that way. And yes, that would be much slower and more inefficient than what they currently do, but they would only need to do that for those that keep on the bleeding edge of security from AI. For everyone else, you're in a security race against highly-paid interests. So the idea of having something on the public internet that you can stop people from archiving (for whatever purpose they want) seems like it's soon to be an old-fashioned one.
So, taking it as a given that you can't stop what these people are currently trying to stop (without a legislative solution and an enforcement mechanism): how can we make scraping less of a burden on individual hosts? Is this thing going to coalesce into centralizing "archiving" authorities that people trust to archive things, and serve as a much more structured and friendly way for LLMs to scrape? Or is it more likely someone will come up with a way to punish LLMs or their hosts for "bad" behavior? Or am I completely off base? Is anyone actually discussing this? And, if so, what's on the table?
There must be some mechanism to prevent tampering in such a setup.
I'm a bit surprised I never read about this till now, though while disappointing it is unfortunately not surprising.
> The Times says the move is driven by concerns about AI companies scraping news content. Publishers seek control over how their work is used, and several—including the Times—are now suing AI companies over whether training models on copyrighted material violates the law. There’s a strong case that such training is fair use.
I suspect part of it might be these corps not wanting people to skip a paywall (whether or not someone would pay even if they had no access is a different story). But this argument makes no sense for the Guardian.
And it's genuinely not that weird for news organisations to want to stop AI scraping. This is just a repeat of their fight with social media embedding.
Sure. The back catalogue should be as close to public domain as possible, libraries keeping those records is incredibly important for research.
But with current news, that becomes complicated as taking the articles and not paying the subscription (or viewing their ads) directly takes away the revenue streams that newsrooms rely on to produce the news. Hence the "Newspaper trying to ban linking" mess, which was never about the links themselves but about social media sites embedding the headline and a snippet, which in turn made all the users stop clicking through and "paying" for the article.
Social media relies on those newsrooms (same with really, most other kinds of websites) to provide a lot of their content. And AI relies on them for all of the training data (remember: "Synthetic data" does not appear ex nihilo) & to provide the news that the AI users request. We can't just let the newsrooms die. The newsroom hasn't been replaced itself, it's revenue has been destroyed.
---
And so, the question of archives pops up. Because yes, you can with some difficulty block out the AI bots, even the social media bots. A paywall suffices.
But this kills archiving. Yet if you whitelist the archives in some way, the AI scrapers will just pull their data out of the archive instead and the newsrooms still die. (Which also makes the archiving moot)
A compromise solution might be for archives to accept/publish things on a delay, keep the AI companies from taking the current news without paying up, but still granting everyone access to stuff from decades ago.
There's just major disagreement about what a reasonable delay is. Most major news orgs and other such IP-holders are pretty upset about AI firm's "steal first, ask permission later" approach. Several AI firms setting the standard that training data is to be paid for doesn't help here either. In paying for training data they've created a significant market for archives, and significant incentive to not make them publicly freely accessible.
Why would The Times ever hand over their catalogue to the Internet Archive if Amazon will pay them a significant sum of money for it? The greater good of all humanity? Good luck getting that from a dying industry.
---
Tangent: Another annoying wrinkle in the financial incentives here is that not all archiving organisations are engaging in fair play, which yet further pushes people to obstruct their work.
To cite a HN-relevant example: Source code archivist "Software Heritage" has long engaged in holding a copy of all the sourcecode they can get their hands on, regardless of it's license. If it's ever been on github, odds are they're distributing it. Even when licenses explicitly forbid that. (This is, of course, perfectly legal in the case of actual research and other fair use. But:)
They were notable involved in HuggingFace's "The Stack" project by sharing a their archives ... and received money from HuggingFace. While the latter is nominally a donation, this is in effect a sale.
---
I find it quite displeasing that the EFF fails to identify the incentives at play here. Simply trying to nag everyone into "doing the thing for the greater good!" is loathsome and doesn't work. Unless we change this incentive structure, the outcome won't change.
Isn't this basically what content-addressable storage is for? Have the site provide the content hashes rather than the content and then put the content on IPFS/BitTorrent/whatever where the bots can get it from each other instead of bothering the site.
Extra points if you can get popular browsers to implement support for this, since it also makes it a lot harder to censor things and a decent implementation (i.e. one that prefers closer sources/caches) would give most of the internet the efficiency benefits of a CDN without the centralization.
If there's one thing people, especially HN users, should've learned by now, it's that there's no enforcement mechanism worth a damn for Internet legislation when incentives don't align.
Bloat, and bandwidth costs are the real problems here. Every one seems to have forgotten basics of engineering and accounting.
Trivial as long as they terminate the TLS on their end, not yours. So you'd just be a residential proxy.
> Rejection hurts … You’ve chosen to reject third-party cookies while browsing our site. Not being able to use third party cookies means we make less from selling adverts to fund our journalism.
We believe that access to trustworthy, factual information is in the public good, which is why we keep our website open to all, without a paywall.
If you don’t want to receive personalised ads but would still like to help the Guardian produce great journalism 24/7, please support us today. It only takes a minute. Thank you.
If only... Despite providing a useful service, they are not as nice towards site owners as one would hope.
Internet Archive says:
> We see the future of web archiving relying less on robots.txt file declarations geared toward search engines
https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
They are not alone in that. The "Archiveteam", a different organization, not to be confused with archive.org, also doesn't respect robots.txt according to their wiki: https://wiki.archiveteam.org/index.php?title=Robots.txt
I think it is safe to say that there is little consideration for site owners from the largest archiving organizations today. Whether there should be is a different debate.
Preventing new human generated text from being used by AI firms (without consent) seems like a valid strategy.
Maybe it'll just be cheaper for CDNs or whatever to sell the data they serve directly instead of doing extra steps with scraping
Modern LLMs are trained on a large percentage of synthetic data.
This sentiment is largely legacy (even though just a couple of years old).
I think this EFF piece could be more forthright (rather than political persuasion), since the matter involves balancing multiple public interest goals that are currently in opposition.
> Organizations like the Internet Archive are not building commercial AI systems.
This NiemanLab article lists evidence that Internet Archive explicitly encouraged crawling of their data, which was used for training major commercial AI models:
| News publishers limit Internet Archive access due to AI scraping concerns (niemanlab.org) | 569 points by ninjagoo 34 days ago | 366 comments | https://news.ycombinator.com/item?id=47017138
> [...] over a fight that libraries like the Archive didn't start, and didn't ask for.
They started or stumbled into this fight through their actions. And (ideology?) they also started and asked for a related fight, about disregard of copyright and exploitation of creators:
| Internet Archive forced to remove 500k books after publishers' court win (arstechnica.com) | 530 points by cratermoon on June 21, 2024 | 564 comments | https://news.ycombinator.com/item?id=40754229
So ads, just not personalized. Remind me again why personalized ads are good for me if I have to pay to have non-personalized ads?
> The "Archiveteam", a different organization, not to be confused with archive.org, also doesn't respect robots.txt according to their wiki
"Archiveteam" exists in a different context. Their usual purpose is to get a copy of something quickly because it's expected to go offline soon. This both a) makes it irrelevant for ordinary sites in ordinary times and b) gives the ones about to shut down an obvious thing to do, i.e. just give them a better/more efficient way to make a full archive of the site you're about to shut down.
NY Times is 0.06% of common crawl.
These news media outlets provide a drop in the ocean worth of information. Both qualitatively and quantitatively.
The news / media industry is really just trying to hold on to their lifeboat before inevitably becoming entirely irrelevant.
(I do find this sad, but it is like the reality - I can already now get considerably better journalism using LLMs than actual journalists - both click bait stuff and high quality stuff)
It's easy to pretend you're human, it's hard to pretend that you have a valid cryptographic signature for Google which attests that your hardware is Google-approved.
Crawling is the price we pay for the web's openness.
That's a problem because archive.org honors removal requests from site owners. Buy an old domain and you can theoretically wipe its archived history clean.
>Archiving and Search Are Legal
But giving full articles away for free to everyone is not. Archive.org has the power to make archives private.
Like, 3 orders of magnitude less compute, conservatively counting.
What happens when the human gives an agent access to said signature? Then you fall back on traditional anti-bot techniques and you're right back where you started.
They don't modify any device and will pass whatever attestation you try to make.
At least NYT is probably on the correct side of Sturgeon’s Law: https://en.wikipedia.org/wiki/Sturgeon%27s_law
LLMs are (apparently) massively used to get information about topics in the real world. Novels aren't going to be much help there. Journalism, particularly in written form, provides a fount of facts presented from different angles, as well as opinions, and it was all there free for the taking…
Wikipedia provides the scantest summary of that, fora and social media give you banter, fake news, summaries of news, and a whole lot of shaky opinions, at best. Novels give you the foundations of language, but in terms of knowledge nothing much beyond what the novel is about.
We don't need to attest signals are analogue vs. digital. The world is going to adapt to the use of Gen AI in everything. The future of art, communications, and productivity will all be rooted in these tools.
> by selection of topics, by distribution of concerns, by emphasis and framing of issues, by filtering of information, by bounding of debate within certain limits. They determine, they select, they shape, they control, they restrict — in order to serve the interests of dominant, elite groups in the society."
> "history is what appears in The New York Times archives; the place where people will go to find out what happened is The New York Times. Therefore it's extremely important if history is going to be shaped in an appropriate way, that certain things appear, certain things not appear, certain questions be asked, other questions be ignored, and that issues be framed in a particular fashion."
The propaganda in the New York times is especially precious because of how highly respected it is, there never was a war or other elite interest they didn't push along.
I joke, but there are those out there who don’t.
You may get an inconvenient answer when you ask the question the other way around.
Soon the news and the historical facts will be unnecessary. You can simply receive your wisdom from the AIs, which, as nondeterministic systems, are free to change the facts at will.
The crux of the problem was the doxxing, not the defense against it.
The pattern here is deference to official narratives at precisely the times when criticism is needed the most.
Instead, training attempts to sample more heavily from higher quality sources, with, I'm sure, a mix of manual and heuristic labeling.
Imagine a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper.
That’s effectively what’s begun happening online in the last few months. The Internet Archive—the world’s largest digital library—has preserved newspapers since it went online in the mid-1990s. The Archive’s mission is to preserve the web and make it accessible to the public. To that end, the organization operates the Wayback Machine, which now contains more than one trillion archived web pages and is used daily by journalists, researchers, and courts.
But in recent months The New York Times began blocking the Archive from crawling its website, using technical measures that go beyond the web’s traditional robots.txt rules. That risks cutting off a record that historians and journalists have relied on for decades. Other newspapers, including The Guardian, seem to be following suit.
For nearly three decades, historians, journalists, and the public have relied on the Internet Archive to preserve news sites as they appeared online. Those archived pages are often the only reliable record of how stories were originally published. In many cases, articles get edited, changed, or removed—sometimes openly, sometimes not. The Internet Archive often becomes the only source for seeing those changes. When major publishers block the Archive’s crawlers, that historical record starts to disappear.
The Times says the move is driven by concerns about AI companies scraping news content. Publishers seek control over how their work is used, and several—including the Times—are now suing AI companies over whether training models on copyrighted material violates the law. There’s a strong case that such training is fair use.
Whatever the outcome of those lawsuits, blocking nonprofit archivists is the wrong response. Organizations like the Internet Archive are not building commercial AI systems. They are preserving a record of our history. Turning off that preservation in an effort to control AI access could essentially torch decades of historical documentation over a fight that libraries like the Archive didn’t start, and didn’t ask for.
If publishers shut the Archive out, they aren’t just limiting bots. They’re erasing the historical record.
Making material searchable is a well-established fair use. Courts have long recognized it’s often impossible to build a searchable index without making copies of the underlying material. That’s why when Google copied entire books in order to make a searchable database, courts rightly recognized it as a clear fair use. The copying served a transformative purpose: enabling discovery, research, and new insights about creative works.
The Internet Archive operates on the same principle. Just as physical libraries preserve newspapers for future readers, the Archive preserves the web’s historical record. Researchers and journalists rely on it every day. According to Archive staff, Wikipedia alone links to more than 2.6 million news articles preserved at the Archive, spanning 249 languages. And that’s only one example. Countless bloggers, researchers, and reporters depend on the Archive as a stable, authoritative record of what was published online.
The same legal principles that protect search engines must also protect archives and libraries. Even if courts place limits on AI training, the law protecting search and web archiving is already well established.
The Internet Archive has preserved the web’s historical record for nearly thirty years. If major publishers begin blocking that mission, future researchers may find that huge portions of that historical record have simply vanished. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.
Do people not also deserve to be protected from being DDOSed? Do people also not deserve to not have their internet traffic be used to DDOS someone?
You are deliberately misrepresenting the situation. The journalists who block archivist traffic are not in any way connected to the blogger who was attempting to investigate the creator of archive.is. You have portrayed them as related in an attempt to garner sympathy for the creator of archive.is.
Here is an account of the facts: https://gyrovague.com/2026/02/01/archive-today-is-directing-...
Duranty's New York Times articles were written in 1931, a decade before America entered World War II. They not only predate an American alliance with the Soviet Union, but they also predate the United States having any diplomatic relations with the Soviet Union whatsoever.
> Go back through the major wars in American history and you can find the New York Times championing the cause of war before each of these.
Are there other major American newspapers who have a history of dissenting against war? Wasn't the New York Times' behavior in most of the conflicts you mention in line with American popular opinion?
their idioms would leak occasionally otherwise
Imagine if all info about Facebook came from Facebook...
It is, but it's one of the only tools they have to prevent the doxxing site to being reachable.
> Do people not also deserve to be protected from being DDOSed?
You mean the person doing the doing should be protected ?
>Do people also not deserve to not have their internet traffic be used to DDOS someone?
Yes, it should have been opt-in. But unless you doesn't run JS, you kinda give right to the website you visit to run arbitrary code anyway.
As for other newspapers, the Times isn't worse but bears the brunt of the criticism because they are after all America's foremost, most influential newspaper.
Of course, never aggressing anyone and transform any aggression agaisnt self into an opportunity to acculturate the aggressor into someone with the same empathic behavior is a paragon of virtuous entity. But paragons of virtue is not the median norm, by definition.
Resorting to DDoS is not pretty, but "why is my violent behavior met with violence" is a little oblivious and reversal of victim and perpetrator roles.
You may end up deciding to continue inflicting harm, intentionally so this time---that is a perfectly valid course to take. But you cannot anymore remain unintentional about it.
But even taking it literally, isn't that one of the things LLMs could actually do? You're essentially asking how a text generator could generate text. The real question is whether the questions would be any good, but the answer isn't necessarily no.
You used to need them, because journalists had the distribution and the sources didn't. In a word of printed newspapers, you couldn't get your story distributed nationally (much less worldwide) without the help of a journalist, doubly so if you wanted to stay anonymous.
Nowadays, you just make a Substack and there's that.
See that recent expose on the Delve fraud as just one example. No journalists were harmed in the making of that article.
Another basic ethological expectation is that the strong dominate the weak, but maybe we shouldn’t base our moral framework around how things are, and rather on how they should be.
I do think it’s a problem. You are the only one excusing bad behavior here.
> You may end up deciding to continue inflicting harm, intentionally so this time---that is a perfectly valid course to take. But you cannot anymore remain unintentional about it.
To be clear, are you talking about the harm of commanding a botnet (which includes you and me) to attack an investigative journalist for investigatively journaling?
For example, would they have been justified to murder the blogger?
Journalism is by definition a secondary source. (Notwithstanding edge cases like articles reporting directly on the news industry itself.)
If a journalist is on location covering a flood, for example, they are the primary source.
A journalist conducting an interview would also be a primary source.
Also a checkbox that says something like “I would like to help commit a crime using my internet traffic” would keep people from having their traffic used without consent.