I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.
I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.
The purpose of a search engine is to display links to web pages, not the entire content. As such, it can be argued it falls under fair use. It provides value to the people searching for content and those providing it.
However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.
I think there needs to be real competition, and I am increasingly becoming certain that the government should be part of that competition. Both "private" companies and "public" governement are biased, but are biased in different ways, and I think there is real value to be created in this clash. It makes it easier for individuals to pick and choose the best option for themselves, and for third independent options to be developed.
The current cycle of knowledge generation is academia doing foundational research -> private companies expanding this research and monetizing it -> nothing. If the last step was expanded to the government providing a barebones but useable service to commodotize it, years after private companies have been able to reap immense profits, then the capabilities of the entire society are increased. If the last step is prevented, then the ruling companies turn to rentseeking and sitting on their lawrels, turn from innovating to extracting.
It stores webpages in multiple formats (HTML snapshot, screenshot, PDF snapshot, and a fully dedicated reader view) so you’re not relying on a single fragile archive method.
There’s both a hosted cloud plan [1] which directly supports the project, and a fully self-hosted option [2], depending on how much control you need over storage and retention.
Users control what sites they want to allow it to record so no privacy worries, especially assuming the plugin is open source.
No automated crawling. The plugin does not drive the users browser to fetch things. Just whatever a user happens to actually view on their own, some percentage of those views from the activated domains gets submitted up to some archive.
Not every view, just like maybe 100 people each submit 1% of views, and maybe it's a random selection or maybe it's weighted by some feedback mechanism where the archive destination can say "Hey if the user views this particular url, I still don't have that one yet so definitely send that one if you see it rather than just applying the normal random chance"
Not sure how to protect the archive itself or it's operators.
I've said it before, and I'll say it again: The main issue is not design patterns, but lack of acceptable payment systems. The EU with their dismantling of visa and mastercard now have the perfect opportunity to solve this, but I doubt they will. They'll probably just create a european wechat.
I've been building tools that integrate with accounting platforms and the number of times a platform's API docs or published rate limits have simply disappeared between when I built something and when a user reports it broken is genuinely frustrating. You can't file a support ticket saying "your docs said X" when the docs no longer say anything because they've been restructured.
For compliance specifically - HMRC guidance in the UK changes constantly, and the old versions are often just gone. If you made a business decision based on published guidance that later changes, good luck proving what the guidance actually said at the time. The Wayback Machine has saved me more than once trying to verify what a platform's published API behaviour was supposed to be versus what it actually does.
The SOC 2 / audit trail point upthread is spot on. I'd add that for smaller businesses, it's not just formal compliance frameworks - it's basic record keeping. When your payment processor's fee schedule was a webpage instead of a PDF and that webpage no longer exists, you can't reconcile why your fees changed.
Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them
They do not care and we will be all worse off for it if these AI companies keep continuing to bombard news publishers RSS feeds.
It is a shame that the open web as we know it is closing down because of these AI companies.
News publishers limit Internet Archive access due to AI scraping concerns
But then it was not really open content anyway.
> When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”
Well - we need something like wikipedia for news content. Perhaps not 100% wikipedia; instead, wikipedia to store the hard facts, with tons of verification; and a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers. I don't know how the model could work, but IF we could come up with this then newspapers who have gatewalls to information would become less relevant automatically. That way we win long-term, as the paid gatewalls aren't really part of the open web anyway.
Sell a "truck full of DAT tapes" type service to AI scrapers with snapshots of the IA. Sort of like the cloud providers have with "Data Boxes".
It will fund IA, be cheaper than building and maintaining so many scrapers, and may relieve the pressure on these news sites.
I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.
We are increasingly becoming blind. To me it looks as if this is done on purpose actually.
That's a travesty, considering that a huge chunk of science is public-funded; the public is being denied the benefits of what they're paying for, essentially.
I also don't think they care even a bit. They're pushing agendas, and not hiding it; rather, flaunting it.
The Wikipedia folks had their own Wikinews project which is essentially on hold today because maintenance in a wiki format is just too hard for that kind of uber-ephemeral content. Instead, major news with true long-term relevance just get Wikipedia articles, and the ephemera are ignored.
Interesting idea. It could be something that archives first and releases at a later date, when the news aren't as much new
Journalism as an institution is under attack because the traditional source of funding - reader subscriptions to papers - no longer works.
To replicate the Wikipedia model would need to replicate the structure of Journalism for it to be reliable. Where would the funding for that come from? It's a tough situation.
Isn't that what state funded news outlets are?
Also, I always wonder about Common Crawl:
Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?
Insurance pays as long as you aren't knowingly grossly negligent. You can even say "yes, these systems don't meet x standard and we are working on it" and be ok because you acknowledged that you were working on it.
Your boss and your bosses boss tell you "we have to do this so we don't get fucked by insurance if so and so happens" but they are either ignorant, lying, or just using that to get you to do something.
I've seen wildly out of date and unpatched systems get paid out because it was a "necessary tradeoff" between security and a hardship to the business to secure it.
I've actually never seen a claim denied and I've seen some pretty fuckin messy, outdated, unpatched legacy shit.
Bringing a system to compliance can reasonably take years. Insurance would be worthless without the "best effort" clause.
Does it just POST the url to them for them to fetch? Or is there any integration/trust to store what you already fetched on the client directly on their archives?
Practically no quality journalism is.
> we need something like wikipedia for news
Wikipedia editors aren’t flying into war zones.
This is from my experience having a personal website. AI companies keep coming back even if everything is the same.
The very first result was a 404
https://aws.amazon.com/compliance/reports/
The jokes write themselves.
Sidebar:
Having been part of multiple SOC audits at large financial firms, I can say that nothing brings adults closer to physical altercations in a corporate setting than trying to define which jobs are "critical".
- The job that calculates the profit and loss for the firm, definitely critical
- The job that cleans up the logs for the job above, is that critical?
- The job that monitors the cleaning up of the logs, is that critical too?
These are simple examples but it gets complex very quickly and engineering, compliance and legal don't always agree.
I hope I’m wrong, but my bot paranoia is at all time highs and I see these patterns all throughout HN these days.
Seriously? What kind of auditor would "fail" you over this? That doesn't sound right. That would typically be a finding and you would scramble to go appease your auditor through one process or another, or reach out to the vendor, etc, but "fail"? Definitely doesn't sound like a SOC2 audit, at least.
Also, this has never particularly hard to solve for me (obviously biased experience, so I wonder if this is just a bubble thing). Just ask companies for actual docs, don't reference urls. That's what I've typically seen, you get a copy of their SOC2, pentest report, and controls, and you archive them yourself. Why would you point at a URL? I've actually never seen that tbh and if a company does that it's not surprising that they're "failing" their compliance reviews. I mean, even if the web were more archivable, how would reliance on a URL be valid? You'd obviously still need to archive that content anyway?
Maybe if you use a tool that you don't have a contract with or something? I feel like I'm missing something, or this is something that happens in fields like medical that I have no insight into.
This doesn't seem like it would impact compliance at all tbh. Or if it does, it's impacting people who could have easily been impacted by a million other issues.
They can charge money for access or disallow all scrapers, but it should not be allowed to selectively allow only Google.
> no privacy worries
This is harder than you might expect. Publishing these files is always risky because sites can serve you fingerprinting data, like some hidden HTML tag containing your IP and other identifiers.
The problem with the LLMs is they capture the value chain and give back nothing. It didn’t have to be this way. It still doesn’t.
They will announce official paid AI access plans soon. Bookmark my works.
So we're basically decided we only want bad actors to be able to scrape, archive, and index.
I wonder if bots/ai will need to build their own specialized internet for faster sharing of data, with human centered interfaces to human spaces.
Jan. 28, 2026, 3:09 p.m.
Outlets like The Guardian and The New York Times are scrutinizing digital archives as potential backdoors for AI crawlers.
As part of its mission to preserve the web, the Internet Archive operates crawlers that capture webpage snapshots. Many of these snapshots are accessible through its public-facing tool, the Wayback Machine. But as AI bots scavenge the web for training data to feed their models, the Internet Archive’s commitment to free information access has turned its digital library into a potential liability for some news publishers.
When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.
Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.
In particular, Hahn expressed concern about the Internet Archive’s APIs.
“A lot of these AI businesses are looking for readily available, structured databases of content,” he said. “The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.” (He admits the Wayback Machine itself is “less risky,” since the data is not as well-structured.)
As news publishers try to safeguard their contents from AI companies, the Internet Archive is also getting caught in the crosshairs. The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive. The majority of FT stories are paywalled, according to director of global public policy and platform strategy Matt Rogerson. As a result, usually only unpaywalled FT stories appear in the Wayback Machine because those are meant to be available to the wider public anyway.
“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” said Michael Nelson, a computer scientist and professor at Old Dominion University. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”
The Guardian hasn’t documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, it’s taking these measures proactively and is working directly with the Internet Archive to implement the changes. Hahn says the organization has been receptive to The Guardian’s concerns.
The outlet stopped short of an all-out block on the Internet Archive’s crawlers, Hahn said, because it supports the nonprofit’s mission to democratize information, though that position remains under review as part of its routine bot management.
“[The decision] was much more about compliance and a backdoor threat to our content,” he said.
When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.” It’s a prospect, he implied, that could undercut the organization’s work countering “information disorder.”
The Guardian isn’t alone in reevaluating its relationship to the Internet Archive. The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.
“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”
Last August, Reddit announced that it would block the Internet Archive, whose digital libraries include countless archived Reddit forums, comments sections, and profiles. This content is not unlike what Reddit now licenses to Google as AI training data for tens of millions of dollars.
“[The] Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” a Reddit spokesperson told The Verge at the time. “Until they’re able to defend their site and comply with platform policies…we’re limiting some of their access to Reddit data to protect redditors.”
Kahle has also alluded to steps the Internet Archive is taking to restrict bulk access to its libraries. In a Mastodon post last fall, he wrote that “there are many collections that are available to users but not for bulk downloading. We use internal rate-limiting systems, filtering mechanisms, and network security services such as Cloudflare.”
Currently, however, the Internet Archive does not disallow any specific crawlers through its robots.txt file, including those of major AI companies. As of January 12, the robots.txt file for archive.org read: “Welcome to the Archive! Please crawl our files. We appreciate it if you can crawl responsibly. Stay open!” Shortly after we inquired about this language, it was changed. The file now reads, simply, “Welcome to the Internet Archive!”
There is evidence that the Wayback Machine, generally speaking, has been used to train LLMs in the past. An analysis of Google’s C4 dataset by the Washington Post in 2023 showed that the Internet Archive was among millions of websites in the training data used to build Google’s T5 model and Meta’s Llama models. Out of the 15 million domains in the C4 dataset, the domain for the Wayback Machine (web.archive.org) was ranked as the 187th most present.
In May 2023, the Internet Archive went offline temporarily after an AI company caused a server overload, Wayback Machine director Mark Graham told Nieman Lab this past fall. The company sent tens of thousands of requests per second from virtual hosts on Amazon Web Services to extract text data from the nonprofit’s public domain archives. The Internet Archive blocked the hosts twice before putting out a public call to “respectfully” scrape its site.
“We got in contact with them. They ended up giving us a donation,” Graham said. “They ended up saying that they were sorry and they stopped doing it.”
“Those wanting to use our materials in bulk should start slowly, and ramp up,” wrote Kahle in a blog post shortly after the incident. “Also, if you are starting a large project please contact us …we are here to help.”
The Guardian’s moves to limit the Internet Archive’s access made us wonder whether other news publishers were taking similar actions. We looked at publishers’ robots.txt pages as a way to measure potential concern over the Internet Archive’s crawling.
A website’s robots.txt page tells bots which parts of the site they can crawl, acting like a “doorman,” telling visitors who is and isn’t allowed in the house and which parts are off limits. Robots.txt pages aren’t legally binding, so the companies running crawling bots aren’t obligated to comply with them, but they indicate where the Internet Archive is unwelcome.
For example, in addition to “hard blocking,” The New York Times and The Athletic include the archive.org_bot in their robots.txt file, though they do not currently disallow other bots operated by the Internet Archive.
To explore this issue, Nieman Lab used journalist Ben Welsh‘s database of 1,167 news websites as a starting point. As part of a larger side project to archive news sites’ homepages, Welsh runs crawlers that regularly scrape the robots.txt files of the outlets in his database. In late December, we downloaded a spreadsheet from Welsh’s site that displayed all the bots disallowed in the robots.txt files of those sites. We identified four bots that the AI user agent watchdog service Dark Visitors has associated with the Internet Archive. (The Internet Archive did not respond to requests to confirm its ownership of these bots.)
This data is not comprehensive, but exploratory. It does not represent global, industry-wide trends — 76% of sites in the Welsh’s publisher list are based in the U.S., for example — but instead begins to shed light on which publishers are less eager to have their content crawled by the Internet Archive.
In total, 241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots.
Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welsh’s original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: “archive.org_bot” and “ia_archiver-web.archive.org”. These bots were added to the robots.txt files of Gannett-owned publications in 2025.
Some Gannett sites have also taken stronger measures to guard their contents from Internet Archive crawlers. URL searches for the Des Moines Register in the Wayback Machine return a message that says, “Sorry. This URL has been excluded from the Wayback Machine.”
“USA Today Co. has consistently emphasized the importance of safeguarding our content and intellectual property,” a company spokesperson said via email. “Last year, we introduced new protocols to deter unauthorized data collection and scraping, redirecting such activity to a designated page outlining our licensing requirements.”
Gannett declined to comment further on its relationship with the Internet Archive. In an October 2025 earnings call, CEO Mike Reed spoke to the company’s anti-scraping measures.
“In September alone, we blocked 75 million AI bots across our local and USA Today platforms, the vast majority of which were seeking to scrape our local content,” Reed said on that call. “About 70 million of those came from OpenAI.” (Gannett signed a content licensing agreement with Perplexity in July 2025.)
About 93% (226 sites) of publishers in our dataset disallow two out of the four Internet Archive bots we identified. Three news sites in the sample disallow three Internet Archive crawlers: Le Huffington Post, Le Monde, and Le Monde in English, all of which are owned by Group Le Monde.
The news sites in our sample aren’t only targeting the Internet Archive. Out of the 241 sites that disallow at least one of the four Internet Archive bots in our sample, 240 sites disallow Common Crawl — another nonprofit internet preservation project that has been more closely linked to commercial LLM development. Of our sample, 231 sites all disallow bots operated by OpenAI, Google AI, and Common Crawl.
As we’ve previously reported, the Internet Archive has taken on the Herculean task of preserving the internet, and many news organizations aren’t equipped to save their own work. In December, Poynter announced a joint initiative with the Internet Archive to train local news outlets on how to preserve their content. Archiving initiatives like this, while urgently needed, are few and far between. Since there is no federal mandate that requires internet content to be preserved, the Internet Archive is the most robust archiving initiative in the United States.
“The Internet Archive tends to be good citizens,” Hahn said. “It’s the law of unintended consequences: You do something for really good purposes, and it gets abused.”
Photo of Internet Archive homepage by SDF_QWE used under an Adobe Stock license.
Andrew Deck is a staff writer covering AI at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email, Bluesky, or Signal (+1 203-841-6241).
Maybe they vibecoded the crawlers. I wish I were joking.
Yes we have hundreds of identical Microsoft and Aws policies, but it's the only way. Checksum the full zip and sign it as part of the contract, that's literally how we do it
While unlikely, the ideal would be for the government to provide a foundational open search infrastructure that would allow people to build on it and expand it to fit their needs in a way that is hard to do when a private companies eschews competition and hides its techniques.
Perhaps it would be better for there to be a sanctioned crawler funded by the government, that then sells the unfiltered information to third parties like google. This would ensure IP rights are protected while ensuring open access to information.
Now AI companies are using residential proxies to get around the obvious countermeasures, I have resorted to blocking all countries that are not my target audience.
It really sucks. The internet is terminally ill.
I don't understand this line of thinking. I see it a lot on HN these days, and every time I do I think to myself "Can't you realize that if things kept on being erased we'd learn nothing from anything, ever?"
I've started archiving every site I have bookmarked in case of such an eventuality when they go down. The majority of websites don't have anything to be used against the "folks" who made them. (I don't think there's anything particularly scandalous about caring for doves or building model planes)
Their big requirement is you need to not be doing any DNS filtering or blocking of access to what it wants, so I've got the pod DNS pointed to the unfiltered quad9 endpoint and rules in my router to allow the machine it's running on to bypass my PiHole enforcement+outside DNS blocks.
^1 https://wiki.archiveteam.org/
^2 https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
And a local archive is one fire, business decision, poor technical choice etc away from getting permanently lost
AI training will be hard to police. But a lot of these sites inject ads in exchange for paywall circumvention. Just scanning Reddit for the newest archive.is or whatever should cut off most of the traffic.
The incentives for online news are really wacky just to begin with. A coin at the convenience store for the whole dang paper used to be the simplest thing in the world.
Indefinitely? Probably not.
What about when a regime wants to make the science disappear?
This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.
Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.
We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.
A link disappearing isn’t a major issue. Not something I’d worry about (but yea might show up as a finding on the SOC 2 report, although I wouldn’t be surprised if many auditors wouldn’t notice - it’s not like they’re checking every link)
I’m also confused why the OP is saying they’re linking to public documents on the public internet. Across the board, security orgs don’t like to randomly publish their internal docs publicly. Those typically stay in your intranet (or Google Drive, etc).
You're missing the existence of technology that allows anyone to create superficially plausible but ultimately made-up anecdotes for posting to public forums, all just to create cover for a few posts here and there mixing in advertising for a vaguely-related product or service. (Or even just to build karma for a voting ring.)
Currently, you can still sometimes sniff out such content based on the writing style, but in the future you'd have to be an expert on the exact thing they claim expertise in, and even then you could be left wondering whether they're just an expert in a slightly different area instead of making it all up.
EDIT: Also on the front page currently: "You can't trust the internet anymore" https://news.ycombinator.com/item?id=47017727
Sometimes it feels like ai-use concerns are a guise to diminish the public record. While on the other hand services like Ring or Flock are archiving the public forever.
Which is a valuable perspective. But it's not a subsitute for a seasoned war journalist who can draw on global experience. (And relating that perspective to a particular home market.)
> I'm sure some of them would fly in to collect data if you paid them for it
Sure. That isn't "a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers."
One part of the population imagines journalists as writers. They're fine on free, ad-supported content. The other part understands that investigation is not only resource intensive, but also requires rare talent and courage. That part generally pays for its news.
Between the two, a Wikipedia-style journalistic resource is not entertaining enough for the former and not informative enough for the latter. (Importantly, compiling an encyclopedia is principally the work of research and writing. You can be a fine Wikipedia–or scientific journal or newspaper–editor without leaving your room.)
In the past libraries used to preserve copies of various newspapers, including on microfiche, so it was not quite feasible to make history vanish. With print no longer out there, the modern historical record becomes spotty if websites cannot be archived.
Perhaps there needs to be a fair-use exception or even a (god forbid!) legal requirement to allow archivability? If a website is open to the public, shouldn't it be archivable?
The problem is that AI companies have decided that they want instant access to all data on Earth the moment that it becomes available somewhere, and have the infrastructure behind them to actually try and make that happen. So they're ignoring signals like robots.txt or even checking whether the data is actually useful to them (they're not getting anything helpful out of recrawling the same search results pagination in every possible permutation, but that won't stop them from trying, and knocking everyone's web servers offline in the process) like even the most aggressive search engine crawlers did, and are just bombarding every single publicly reachable server with requests on the off chance that some new data fragment becomes available and they can ingest it first.
This is also, coincidentally, why Anubis is working so well. Anubis kind of sucks, and in a sane world where these companies had real engineers working on the problem, they could bypass it on every website in just a few hours by precomputing tokens.[2] But...they're not. Anubis is actually working quite well at protecting the sites it's deployed on despite its relative simplicity.
It really does seem to indicate that LLM companies want to just throw endless hardware at literally any problem they encounter and brute force their way past it. They really aren't dedicating real engineering resources towards any of this stuff, because if they were, they'd be coming up with way better solutions. (Another classic example is Claude Code apparently using React to render a terminal interface. That's like using the space shuttle for a grocery run: utterly unnecessary, and completely solvable.) That's why DeepSeek was treated like an existential threat when it first dropped: they actually got some engineers working on these problems, and made serious headway with very little capital expenditure compared to the big firms. Of course they started freaking out, their whole business model is based on the idea that burning comical amounts of money on hardware is the only way we can actually make this stuff work!
The whole business model backing LLMs right now seems to be "if we burn insane amounts of money now, we can replace all labor everywhere with robots in like a decade", but if it turns out that either of those things aren't true (either the tech can be improved without burning hundreds of billions of dollars, or the tech ends up being unable to replace the vast majority of workers) all of this is going to fall apart.
Their approach to crawling is just a microcosm of the whole industry right now.
[1]: https://en.wikipedia.org/wiki/Common_Crawl
[2]: https://fxgn.dev/blog/anubis/ and related HN discussion https://news.ycombinator.com/item?id=45787775
It’s very unfortunate and a short sighted way to operate.
Links alone can be tempting as you've to reference the same docs or policies over and over for various controls.
[1]: https://arstechnica.com/civis/threads/journalistic-standards...
lol seriously, this is like... at least 50% of the time how it would play out, and I think the other 49% it would be "ah sorry, I'll grab that and email it over" and maybe 1% of the time it's a finding.
It just doesn't match anything. And if it were FEDRAMP, well holy shit, a URL was never acceptable anyways.
That's actually a potentially good business idea - a legally certifiable archiving software that captures the content at a URL and signs it digitally at the moment of capture. Such a service may become a business requirement as Internet archivability continues to decline.
Sadly, it does not even have to be an acquisition or rebrand. For most companies, a simple "website redo", even if the brand remains unchanged, will change up all the URL's such that any prior recorded ones return "not found". Granted, if the identical attestation is simply at a new url, someone could potentially find that new url and update the "policy" -- but that's also an extra effort that the insurance company can avoid by requiring screen shots or PDF exports.
Even if the content is taken down, changed or moved, a copy is likely to still be available in the Wayback Machine.
Or are you thinking of companies like Iron Mountain that provide such a service for paper? But even within corporations, not everything goes to a service like Iron Mountain, only paper that is legally required to be preserved.
A society that doesn't preserve its history is a society that loses its culture over time.
@dang do you have any thoughts about how you’re performing AI moderation on HN? I’m very worried about the platform being flooded with these Submarine comments (as PG might call them).
But - as another poster pointed out - Wikipedia offers this, and still gets hammered by scrapers. Why buy when free, I guess?
I belive many publications used to do this. The novel threat is AI training. It doesn't make sense to make your back catalog de facto public for free like that. There used to be an element of goodwill in permitting your content to be archived. But if the main uses are circumventing compensation and circumventing licensing requirements, that goodwill isn't worth much.
BUT, it's hard to learn from history if there's no history to learn...
Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.
The truly important stuff exists in many forms, not just online/digital. Or will be archived with increased effort, because it's worth it.
I think it "failed" because people expected it to be a replacement transport layer for the existing web, minus all of the problems the existing web had, and what they got was a radically different kind of web that would have to be built more or less from scratch.
I always figured it was a matter of the existing web getting bad enough, and then we'd see adoption improve. Maybe that time is near.
That is: if it's not accessible by a human who was blocked?
I don't know how exactly it achieves being "legally certifiable", at least to the point that courts are trusting it. Signing and timestamping with independent transparency logs would be reasonable.
Any vendor who you work with should make it trivial to access these docs, even little baby startups usually make it quite accessible - although often under NDA or contract, but once that's over with you just download a zip and everything is there.
Having your cake and eating it too should never be valid law.
They're getting very clever and tricky though; a lot of them have the owners watching and step in to pretend that they're not bots and will respond to you. They did this last week and tricked dang.
Great point. If my personal AI assistant cannot find your product/website/content, it effectively may no longer exist! For me. Ain't nobody got the time to go searching that stuff up and sifting through the AI slop. The pendulum may even swing the other way and the publishers may need to start paying me (or whoever my gatekeeper is) for access to my space...
Every comment section here can be summed up as "LLM bad" these days.
If you don’t want your bad behavior preserved for the historical record, perhaps a better answer is to not engage in bad behavior instead of relying on some sort of historical eraser.
Becase it costs money to serve them the content.
That's what I thought the first time I was involved in a SOC2 audit. But a lot of the "evidence" I sent was just screenshots. Granted, the stuff I did wasn't legal documents, it was things like the output of commands, pages from cloud consoles, etc.
[1] https://www.mololamken.com/assets/htmldocuments/NLJ_5th%20Ci...
[2] https://www.nortonrosefulbright.com/en-au/knowledge/publicat...
It's not "LLM bad" — it's "LLM good, some people bad, bad people use LLM to get better at bad things."
Is the answer regulate AI? Yes.
What I would not do is take a screenshot of a vendor website and say "look, they have a SOC2". At every company, even tiny little startup land, vendors go through a vendor assessment that involves collecting the documents from them. Most vendors don't even publicly share docs like that on a site so there'd be nothing to screenshot / link to.
Because when you build it you aren't, presumably, polling their servers every fifteen minutes for the entire corpus. AI scrapers are currently incredibly impolite.