Their fetcher (not crawler) has user agent Perplexity-User. Since the fetching is user-requested, it ignores robots.txt . In the article, it discusses how blocking the “Perplexity-User” user agent doesn’t actually work, and how perplexity uses an anonymous user agent to avoid being blocked.
No amount of robots.txt or walled-gardening is going to be sufficient to impede generative AI improvement: common crawl and other data dumps are sufficiently large, not to mention easier to acquire and process, that the backlash against AI companies crawling folks' web pages is meaningless.
Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.
The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.
Then when they asked perplexity it came up with details about the 'exact' content (according to Cloudflare) but their attached screenshot shows the opposite, it shows some generic guesses about the domain ownership and some dynamic ads based on the domain name.
If Perplexity was stealthily visiting the dummy site they would have seen it, as the site was not indexed and no one else was visiting the site. Instead it appears they made assertions about general traffic, not their dummy site.
Its not very convincing.
We learned to dislike "bubbles" in the past decades but bubbles make sense and are natural, obviously if you're not alone in it.
When it becomes awfully busy with machines and machine content humans will learn to reconnect.
Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.
I don't really mind because history shows this is a temporary thing, but I hope web site maintainers have a plan B to hoping Cloudflare will protect them from AI forever. Whoever has an onramp for people who run websites today to make money from AI will make a lot of money.
Which makes it particularly interesting now that Apple is being linked with Perplexity. Because in large part p2p music services were effectively consigned to history by Apple (primarily) negotiating with the music industry so that it could provide easy, seamless purchase and playback of legal music for their shiny new (at the time) mass-market Apple iPod devices: it then turning out that most users are happy to pay for content if it is not too expensive and is very convenient.
Given Apple’s existing relationships with publishers through its music, movies, books, and news services, it’s not hard to imagine them attempting a similar play now.
I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.
I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.
LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.
Much like a trolley drop off at your local shopping center car park. Some users will adhere to it and drop their trolleys in after their done. Others will not and will leave it wherever.
Your machine might access a page via a browser that is human readable. My machine might read it via software and present the content to me in some other form of my choosing. Neither is wrong. Just different.
Don't like it? Then don't post your website on the internet...
No thanks, you don't counter shit with more but slightly different shit.
He went on, upfront — I’d give him that, to explain how he is expecting a certain percentage of that income that will come from enforcing this on those AI companies and when the AI companies pay up to crawl.
Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.
Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?
1. If I as a human request a website, then I should be shown the content. Everyone agrees.
2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.
3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?
Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.
>Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s blog post as a “sales pitch,” adding in an email to TechCrunch that the screenshots in the post “show that no content was accessed.” In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog “isn’t even ours.”
There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).
I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.
$ curl -sI https://www.perplexity.ai | head -1
HTTP/2 403
Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation
Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit
The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.
god help us if they ever manage to build anything more than shitty chatbots
if I am willing to pay a penny a page, i and the people like me won't have to put up with clickwrap nonsense
free access doesn't have to be shut off (ok, it will be, but it doesn't have to be, and doesn't that tell you something?)
reddit could charge stiffer fees, but refund quality content to encourage better content. i've fantacized about ideas like "you pay upfront a deposit; you get banned, you lose your deposit; withdraw, have your deposit back", the goal being simplify the moderation task while encouraging quality.
because where the internet is headed is just more and more trash.
here's another idea, pay a penny per search at google/search engine of choice. if you don't like the results, you can take the penny back. google's ai can figure out how to please you. if the pennies don't keep coming in, they serve you ad-infested results; serve up ad-infested results, you can send your penny to a different search engine.
If you want to gatekeep your content, use authentication.
Robots.txt is not a technical solution, it's a social nicety.
Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.
Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.
> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:
> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.
> Hello, would you be able to assist me in understanding this website? https:// […] .com/
Under this situation Perplexity should still be permitted to access information on the page they link to.
robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:
> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
— https://www.robotstxt.org/faq/what.html
If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.
These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.
There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.
If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.
The web will be a much worse place if such services are all forced behind captchas or logins.
you've been cruising the interstate in your robotaxi, shelling out $150 in stablecoins at the cloudflare tollbooth. a palantir patrol unit pulls you over. the optimus v4 approaches your window and contorts its silicone face into a facsimile of concern as it hits you with the:
"sir, have you been botting today?"
immediately you remember how great you had it in the '20s when you used to click CAPTCHA grids to prove your humanity to dumb algorithms, but now the machines demand you recite poetry or weep on command
"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile
"yeah... im gonna need you to exit the vehicle and take a field humanity test"
I think we've been using different internets. The one I use doesn't seem to be built on trust at all. It seems to be constantly syphoning data from my machine to feed the data vampires who are, apparently, additing to (I assume, blood-soaked) cookies
I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.
Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.
It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.
the service is actually very convenient no matter faang likes it or not.
Now, it's a gazillion of AI crawlers and python crawlers, MCP servers that offer the same feature to anyone "building (personal workflow) automation" incl. bypass of various, standard protection mechanisms.
Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.
AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.
Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).
I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.
I want my work to be freely available to any person who wants it. Feel free to transform my material as you see fit. Hell, do it with LLMs! I don't care.
The LLM isn't the problem, it's what companies like Perplexity are doing with the LLM. Do not create commercial products that regurgitate my work as if it was your own. It's de facto theft, if not de jure theft.
Knowing that it is not de jure theft, and so I have no legal recourse, I will continue to tune my servers to block and/or deceive Perplexity and similar tools.
By the way, I do not use my websites as a revenue stream. This isn't about money.
If you are the source I think they could make plenty of sense. As an example, I run a website where I've spent a lot of time documenting the history of a somewhat niche activity. Much of this information isn't available online anywhere else.
As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work. Even more so when it actually costs me money to serve requests to the company!
"It was actually a caching issue on our end. ;) I just fixed it a few min ago..."
Lets not go on a witch hunt and blame everything on AI scrapers.
I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.
> If you want to gatekeep your content, use authentication.
Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?
> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?
Noone will care to share anything for free anymore, because it's AI companies profiting off their hard work. And no way to prevent that from happening, because these crawlers don't identify themselves.
(IANAL) tortious interference
Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".
Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.
If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.
But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.
If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?
That's not what Perplexity own documentation[1] says though:
"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity
Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."
``` The user wants me to create a more elaborate greentext with deeper worldbuilding and references to internet culture/incel memes, but specifically asks for "tasteful" ones and explicitly says no violent references. I should incorporate classic internet/chan culture references like tendies, NEET lifestyle, "tfw no gf", basement dwelling, etc. while building out this dystopian 2045 world where proving humanity is the central challenge.
\>be me, 2045 \>32yo NEET still living in mom's smart-basement \>she threatens to turn off my Neuralink if I don't go outside today \>grudgingly take the robotaxi to pick up tendies \>$150 stablecoin toll at cloudflare checkpoint \>have to solve a moral dilemma to prove I'm not running on GPT-7 \>sweating.dll \>palantir unit pulls me over \>optimus v4 leans into window \>its facial mesh attempts "concern_expression_v2.blend" \>"sir, when did you last feel genuine human connection?" \>flashback to 2024 when the girl at McDonalds gave me extra honey mustard \>that was before the McBots took over \>"t-twenty one years ago officer" \>optimus's empathy subroutines activate \>"sir I need you to perform a field humanity test" \>get out, knees weak from vitamin D deficiency \>"please describe your ideal romantic partner without using the words 'tradwife' or 'submissive'" \>brain.exe has stopped responding \>try to remember pre-blackpill emotions \>"someone who... likes anime?" \>optimus scans my biometrics \>"stress patterns indicate authentic social anxiety, carry on citizen" \>get back in robotaxi \>it starts therapy session \>"I notice you ordered tendies again. Let's explore your relationship with your mother" \>tfw the car has better emotional intelligence than me \>finally get tendies from Wendy's AutoServ \>receipt prints with mandatory "rate your humanity score today" \>3.2/10 \>at least I'm improving
\>mfw bots are better at being human than humans \>it's over for carboncels ```
Part of me thinks that the open web has a paradox of tolerance issue, leading to a race to the bottom/tragedy of the commons. Perhaps it needs basic terms of use. Like if you run this kind of business, you can build it on top of proprietary tech like apps and leave the rest of us alone.
Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.
I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.
The next step in your progression here might be:
If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?
Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.
They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.
Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.
If AI crawlers win here, we all lose.
You're not the only stakeholder in any of those interactions. There's you, a mediator (search or LLM), and the website owner.
The website owner (or its users) basically do all the work and provide all the value. They produce the content and carry the costs and risks.
The pre-LLM "deal" was that at least some traffic was sent their way, which helps with reach and attempts at monetization. This too is largely a broken and asymmetrical deal where the search engine holds all the cards but it's better than nothing.
A full LLM model that no longer sends traffic to websites means there's zero incentive to have a website in the first place, or it is encouraged to put it behind a login.
I get that users prefer an uncluttered direct answer over manually scanning a puzzling web. But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.
I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.
On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.
The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.
So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.
Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.
The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.
Now, IMO:
So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)
But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.
Whoever it is (and I now suspect it is Perplexity based on this Cloudflare post), they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.
The line is drawn for me on my own computer. Even if I am in your building, my phone remains mine.
First time hearing this. Almost every single grocery store either supports Instacart, or has partnership with a similar service.
Ultimately the root issue is that copyright is inherently flawed because it tries to increase available useful information by restricting availability. We'd be better off by not pretending that information is scarce and looking for alternative to fund its creation.
Like most AI companies, Perplexity has established user agent strings for both these cases, and the behavior that Cloudflare is calling out does not use either. It pretends to be a person using Chrome on MacOS.
Either way, the CDNs profit big time from the AI scraping hype and the current copyright anarchy in the US
It is your prerogative to tune your servers as you see fit, but as LLM adoption increases you'll merely find that your site has fewer and fewer visits overall, so your content will only be utilized by you and a vanishingly small group of other persons. Perhaps you're OK with that, and that's also fine for the rest of us.
It's strange you mention theft, and then say it isn't about money. For me, and many others, it's about practicality and efficiency. We went from having to visit physical libraries to using search engines, and now we're entering the era of increasingly intelligent content fetch+preprocess tools.
Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."
> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?
To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.
But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.
I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).
But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.
This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.
If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.
Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).
Jury's out on these questions, I think.
E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.
They are already paying, it is the way they are paying that causes the mess. When you buy a product, some fraction of the price is the ad budget that gets then distributed to websites showing ads. Therefore there is also nothing wrong with blocking ads, they have already been paid for, whether you look at them or not. The ad budget will end up somewhere as long as not everyone is blocking all ads, only the distribution will get skewed. Which admittedly might be a problem for websites that have a user base that is disproportionally likely to use ad blockers.
Paying for content directly has the problem that you can only pay for a selected few websites before the amount you have to pay becomes unreasonable. If you read one article on a hundred different websites, you can not realistically pay for a hundred subscriptions that are all priced as if you spent all your time on a single website. Nobody has yet succeeded in creating a web wide payment method that only charges you for the content that you actually consume and is frictionless enough to actually work, i.e. does not force you to make a conscious payment decisions for a few cents or maybe even only fractions of a cent for every link you click and is not a privacy nightmare collecting all the links you click for billing purposes.
Also if you directly pay for content, you will pay twice - you will pay for the subscription and you will still pay into the ad budget with all the stuff you buy.
The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.
Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.
They do end up looking bad out of Cloudflare's report, who are the "good guys" in this story - btw Cloudflare's been very pushy lately with their we'll save the web, content independence day marketspeak. But deep in the back of my head, Cloudflare's goodwill elevates Perplexity cunning habilities (assuming they're the culprit since no real evidence, only heresay is in the OP), both companies look like titans fighting, which ends up being positive for Perplexity, at least in the inflated perception of their firepower... if that makes any sense.
No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.
We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.
no, because we'll end up with remote attestation needed to access any site of value
Cloudflare released these insights showing the disparity between crawling/scraping and visits referred from the AI platforms.
https://radar.cloudflare.com/ai-insights#crawl-to-refer-rati...
Personally, I'm now less interested in using Perplexity, and more interested in using an OpenAI product.
But really, controlling which automated retrieval tools are allowed has always been more of a code of honor than a technical control. And that trust you mention has always been broken. For as long as I can remember anyway. Remember LexiBot and AltaVista?
This case (“go research this subject for me”) is the grey area here. It’s not the same as simple scraping or search indexing, it’s a new activity that is similar in some ways.
Mojeek LLM (https://www.mojeek.com) uses citations.
For me, the dividing line is whether someone else's profit is at my expense. If I sell a book, and someone starts hawking cheaper photocopies of it, that takes away my future sales. It's at my expense, and I'm harmed.
But if someone takes my book's story and writes song lyrics derived from it, I might feel a little envy (perhaps I've always wanted to be a songwriter), but I don't think I'd harbor ill will. I might even hope for the song to be successful, as it would surely drive further sales of my book.
It's human nature to covet someone else's success, but the fact is there was nothing stopping me (except talent) from writing the song.
They allow the big platforms to pay for special access. If you wanted to run a scraper, however, you're not allowed, despite the internet standards and protocols and the laws governing network access and free communications standards responsibilities by ISPs and service providers not granting the authority to any party involved with cloudflare blocking access.
It's equivalent to a private company deciding who, when, and how you can call from your phone, based on the interests and payments of people who profit from listening to your calls. What we have is not normal or good, unless you're exploiting the users of websites for profit and influence.
> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.
If it is not recursive access, and is only one file, then it hopefully should be OK (except for issues with HTML where common browsers will usually also download CSS, JavaScripts, WebAssembly, pictures, favicons (even if the web page does not declare any favicons), etc; many "small web" formats deliberately avoid this), especially if it is just used only since you requested it.
However, if they do then use it to train their model, without documenting that, that can be a problem, especially if the file being accessed is not intended to be public; but this is a different issue than the above.
That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.
B/ my brother used to use "fetcher" as a non-swear for "fucker"
The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.
LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.
IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.
EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/
Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.
Might does not make right.
And very likely Perplexity is in fact using a Chrome-compatible engine to render the page.
That's basically how many crowdsourced crawling/archive projects work. For instance, sci-hub and RECAP[1]. Do you think they should be shut down as well? In both cases there's even a stronger justification to shutting them down, because the original content is paywalled and you could plausibly argue there's lost revenue on the line.
Crawling is legal. Training is presumably legal. Long may the little guys do both.
But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.
It's also a gift to your competitors.
You're certainly free to do it. It's just a really faint example of you being "in control" much less winning over LLM agents: Ok, so the people who cared about your content can't access it anymore because you "got back" at Perplexity, a company who will never notice.
It is confusing.
Imagine someone at another company reads your site, and it informs a strategic decision they make at the company to make money around the niche activity you're talking about. And they make lots of money they wouldn't have otherwise. That's totally legal and totally ethical as well.
The reality is, if you do hard work and make the results public, well you've made them public. People and corporations are free to profit off the facts you've made public, and they should be. There are certain limited copyright protections (they can't sell large swathes of your words verbatim), but that's all.
So the idea that you don't want companies to profit from your hard work is unreasonable, if you make it public. If you don't want that to happen, don't make anything public.
How do you square these two? Of course big companies profit from your work, this is why they send all these bots to crawl your site.
Right, and the domain was configured to disallow crawlers, but Perplexity crawled it anyway. I am really struggling to see how this is hard to understand. If you mean to say "I don't think there is anything wrong with ignoring robots.txt" then just say that. Don't pretend they didn't make it clear what they're objecting to, because they spell it out repeatedly.
There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.
They offer many products for the sole purpose of enabling their customers to use AI as a part of their product offers, as even the most cursory inquiry would have uncovered.
We're out here critiquing shit based on vibes vs. reality now.
[1]https://developers.cloudflare.com/llms.txt [2]https://developers.cloudflare.com/workers/prompt.txt
It is also only a matter of time scrapers once again get through walls by twitter, reddit and alike. This is, after all, information everyone produced, without being aware of it was now considered not theirs anymore.
They will be quite the wiser if they track/limit how often your shopper enters the store. You probably aren't entering the same store fifteen times every day and neither would be your shopper if they were only doing it on your behalf.
ChatGPT probably uses a cache though. Theoretically, the average load on the original sites could be far less than users accessing them directly.
IME it's mostly because someone else put something "wrong" online first.
Now, most of the value I find in the web comes from niche home-improvement forums (which Reddit has mostly digested). But even Reddit has a problem if users stop showing up from SEO.
Maybe that would result in limited fetching instead of internet wide fetching. I dunno, just spitballing.
Let's be real, Google et al have been doing this for years with their quick answer and info boxes. AI chatbots are worse but it's not like the big search engines were great before AI came along. Google had made itself the one-stop shop for a huge percentage of users. They paid billions to be the default search engine on Apple's platforms not out of the goodness of their hearts but to be the main destination for everyone on the web.
That's where you lost me, as this is key to GP's point above and it takes more than a mere out-of-left-field declaration that "it doesn't matter" to settle the question of whether it matters.
I think they raised an important point about using cached data to support functions beyond the scope of simple at-request page retrieval.
That skips the part about one party's unique role in the abuse of trust.
Otherwise it's just adding an unwilling website to a crawl index, and showing the result of the first crawl as a byproduct of that action.
When 99.9% of users are using the same few types of locked down devices, operating systems, and browsers that all support remote attestation, the 0.1% doesn't matter. This is already the case on mobile devices, it's only a matter of time until computers become just as locked down.
Who cares if perplexity will never notice. Or competitors get an advantage. It is a negative for users using perplexity or visiting directly because the content doesn't exist.
That's the world perplexity and others are creating. They will be able to pull anything from the web but nothing will be left.
Ultimately these AI tools are useful because they have access to huge swaths of content, and the owners of these tools turn a lot of revenue by selling access to these tools. Ultimately I think the internet will end up a much worse place if companies don't respect clearly established wishes of people creating the content, because if companies stop respecting things like robots.txt then people will just hide stuff behind logins, paywalls and frustraing tools like cloudflare which use heuristics to block malicious traffic.
No, they did not. Crawling = recursive fetching, which wasn't what was happening here.
But also, I don't think there is anything wrong with ignoring robots.txt. In fact, I believe it is discriminatory and people should ignore it. See: https://wiki.archiveteam.org/index.php/Robots.txt
I'm not really addressing the issue raised in the article. I am noting that the LLM, when asked, is either lying to the user or making a statement that it does not know to be true (that there is no robots.txt). This is way beyond poor summarization.
If it looks like a duck, quacks like a duck and surfs a website like a duck, then perhaps we should just consider it a duck...
Edit: I should also add that it does matter what you do with it afterwards, because it's not content that belongs to you, it belongs to someone else. The law in most jurisdictions quite rightly restricts what you can do with content you've come across. For personal, relatively ephemeral use, or fair quoting for news etc. - all good. For feeding to your AI - not all good.
But when a trillion dollar industry does it, its okay?
Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.
But I can imagine more extreme measures. e.g. old web of trust style request signing[0]. I don’t see any easy way for scrapers to beat a functioning WOT system. We just don’t happen to have one of those yet.
>Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Many websites (especially the bigger ones) are just businesses. They pay people to produce content, hopefully make enough ad revenue to make a profit, and repeat. Anything that reproduces their content and steals their views has a direct effect on their income and their ability to stay in business.
Maybe IA should have a way for websites to register to collect payment for lost views or something. I think it’s negligible now, there are likely no websites losing meaningful revenue from people using IA instead, but it might be a way to get better buy in if it were institutionalized.
Existing subject-matter experts who blog for fun may or may not stick around, depending on what part of it is “fun” for them.
While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.
Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.
It's not like newspapers where advertising is paid in full before publishers put stories online. It has not been that way for a long time.
Your reasoning for not accessing advertising reminds me of that scene in Arrested Development where, to hide the money they've taken out of the till, they throw away the bananas. It doesn't hide the transaction, it compounds the problem.
If publishers were getting paid before any ads ran the publishing business would be a hell of a lot stronger.
It's like saying a web browser that is customized in any way is wrong. If one configures their browser to eagerly load links so that their next click is instant, is that now wrong?
LLM programs does not have human rights.
So far, AI has had the opposite effect on my site. I've now been featured on both Hackaday and Adafruit's blog. Both features were clearly AI-generated. Both posts coincided with an influx of emails from folks interested in my work.
Perplexity is good at citing things when it decides to cite things and when you tell it to cite things. It can and does spit out plain expository text with no indication of the information's origin. I do appreciate that you have better-than-usual habits about validating sources.
I think you may have misinterpreted my remark about money. With the direction conversations around AI have been going lately, I was expecting a backhanded accusation that I was farming ad revenue.
"It's not about money" meant that I have nothing to lose financially by losing direct human traffic to my websites. Instead, I stand to lose those aforementioned email conversations.
Can't you read?
Yes, you can identify who got paid to sign a key and ban them. They will create another key, go to someone else, pretend to be someone not yet signed up for WoT (or pay them), and get their new key signed, and sign more keys for money.
So many people will agree to trust for money, and accountability will be so diffuse, that you won't be able to ban them all. Even you, a site operator, would accept enough money from OpenAI to sign their key, for a promise the key will only be used against your competitor's site.
It wouldn't take a lot to make a binary-or-so tree of fake identities, with exponential fanout, and get some people to trust random points in the tree, and use the end nodes to access your site.
Heck, we even have a similar problem right now with IP addresses, and not even with very long trust chains. You are "trusted" by your ISP, who is "trusted" by one of the RIRs or from another ISP. The RIRs trust each other and you trust your local RIR (or probably all of them). We can trace any IP to see who owns it. But is that useful, or is it pointless because all actors involved make money off it? You know, when we tried making IPs more identifying, all that happened is VPN companies sprang up to make money by leasing non-identifying IPs. And most VPN exits don't show up as owned by the VPN company, because they'd be too easy to identify as non-identifying. They pay hosting providers to use their IPs. Sometimes they even pay residential ISPs so you can't even go by hosting provider. The original Internet was a web of trust (represented by physical connectivity), but that's long gone.
Meanwhile its going to fuck over real users.
Many people put more effort into their hobbies than into their "full time" job.
Some of it will go away but perhaps without the expectation that you can earn money more people will share freely.
> While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.
We don't have to make all business models that someone might want possible though.
> Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.
Actually this model is quite common. There are tons of sources of free information curated by volunteers - most are just too niece to get to the scale of Wikipedia.
The argument that LLM outfits are using is that they are just exercising “fair use” / education rights to do an end run around copyright law. Without strengthening the rules on that I’m not sure I see how the database + team of lawyers approach would work.
But with that, sure, that’s an approach that seems to have legs in other contexts.
that's called breaking and entering, and generally frowned upon -- by-passing the "closed sign".
The HTTP protocol does not specify what is right and wrong. The fact a protocol encodes or permits a particular kind of behaviour does not mean that every use of the protocol is ethically justified. I am sure you would agree with me that "black people can't visit this server" would be such an unethical rule, even though HTTP permits you to enforce such a rule. So let's forget about the protocol for a minute.
Is it morally wrong to lie about your User Agent in order to visit a website. Well, that depends on whether it is legitimate for the server operator to discriminate according to the User Agent. If it is not legitimate, then lying about your User Agent to circumvent the restriction is morally justified.
So we are back at square one: is it legitimate for a server operator to discriminate what sort of a client is used to visit them. Since the service is public, the person is allowed to visit the service and to read the content. If the client is misbehaved in some way (some LLM scrapers are) then this is a legitimate difference. But if this is controlled for, so the LLM scraper can't be easily distinguished from a human doing the same thing, then the service is not harmed any more than would be ordinary. Therefore the discrimination is not legitimate.
Computer programs don't take actions, people do. If I use a web browser, or scrape some site to make an LLM, that's me doing it, not the program. And I have human rights.
If you think training LLMs should be illegal, just say that. If you think LLM companies are putting an undue strain on computer networks and they should be forced to pay for it, say that. But don't act like it's a virtue to try and capriciously gatekeep access to a public resource.
No you wouldn't be. Even if someone tells you not to visit your site, you have every legal right to continue visiting it, at least in the US.
Under common interpretation of the CFAA, there needs to be a formal mechanism of authorized access. E.g. you could be charged if you hacked into a password-protected area of someone's site. But if you're merely told "hey bro don't visit my site", that's not going to reach the required legal threshold.
Which is why crawlers aren't breaking the law. If you want to restrict authorization, you need to actually implement that as a mechanism by creating logins, restricting content to logged-in users, and not giving logins to crawlers.
This may be missing some context, but it seems as though you're saying that you made something with AI and it led to traction. That's great! Seems off the point that blocking LLM service will lead to less exposure over time though.
> Perplexity is good at citing things when it decides to cite things and when you tell it to cite things.
Maybe I'm just lucky, but a quick skim of my Perplexity history yielded only 2 instances of no citations, and they were for general coding queries. I've never had to ask it to cite anything, as that's built into the default prompt.
> lose those aforementioned email conversations.
I think those will remain a possibility as long as LLM users, or services, ensure citations are included in output.
If someone writes valuable stuff on a blog almost nobody finds, that's a tragedy.
If LLM's can process the information and provide it to people in conversations where it will be most helpful, where they never would have found it otherwise, then that's amazing!
If all you're trying to do is help people with the information you've discovered, why do you care if it's delivered via your own site or via LLM? You just want it out there helping people.
You do, but you give up those rights when you make the work public.
You think an author has any control over who their book gets lent to once somebody buys a copy? You think they get a share of profits when a CEO reads their book and they make a better decision? Of course not.
What you're asking for is unreasonable. It's not workable. Knowledge can't be owned. Once you put it out there, it's out there. We have copyright and patent protections in specific circumstances, but that's all. You don't own facts, no matter how much hard work and research they took to figure out.
Neither do I, I just thought your reply was disingenuous.
> Crawling = recursive fetching
I do not find this convincing. I am ok with using the word crawler for recursive fetching only. But robots.txt is not only for excluding crawlers and never has been. From the very beginning it was used to exclude specific automated clients, whether they only fetch one page or many, and that is certainly how the vast majority of people think about it today.
Like I implied in my first comment, I have no problem with you saying you dislike robots.txt, but it is not reasonable to pretend the article is unclear in some way.
No.
robots.txt is designed to stop recursive fetching. It is not designed to stop AI companies from getting your content. Devising scenarios in which AI companies get your content without recursively fetching it is irrelevant to robots.txt because robots.txt is about recursively fetching.
If you try to use robots.txt to stop AI companies from accessing your content, then you will be disappointed because robots.txt is not designed to do that. It’s using the wrong tool for the job.
Indeed, Reddit sold their data the other thay GPT2 was announced, and it was very apparent why everyone closed their APIs in 2021-2023. Wonder what Aaron would've said about it.
Now we have walled gardens of information where people are allowed to plant, but never own the blossom.
An agent making a request on the explicit behalf of someone else is probably something most of us agree is reasonable. "What are the current stories on Hacker News?" -- the agent is just doing the same request to the same website that I would have done anyways.
But the sort of non-explicit just-in-case crawling that Perplexity might do for a general question where it crawls 4-6 sources isn't as easy to defend. "Are polar bears always white?" -- Now it's making requests I wouldn't have necessarily made, and it could even been seen as a sort of amplification attack.
That said, TFA's example is where they register secretexample.com and then ask Perplexity "what is secretexample.com about?" and Perplexity sends a request to answer the question, so that's an example of the first case, not the second.
Mind you I'm not saying electric scooters are a bad idea, I have one and I quite enjoy it. I'm saying we didn't need five fucking startups all competing to provide them at the lowest cost possible just for 2/3s of them to end up in fucking landfills when the VC funding ran out.
The point is the web is changing, and people use a different type of browser now. Ans that browser happens to be LLMs.
Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.
Grow up cloud flare, some is your business models don’t make sense any more.
At this moment I am using Perplexity's Comet browser to take a spotify playlist and add all the tracks to my youtube music playlist. I love it.
I think this might actually point at the end state. Scraping bots will eventually get good enough to emulate a person well enough to be indistinguishable (are we there yet?). Then, content creators will have to price their content appropriately. Have a Patreon, for example, where articles are priced at the price where the creator is fine with having people take that content and add it to the model. This is essentially similar to studios pricing their content appropriately… for Netflix to buy it and broadcast it to many streaming users.
Then they will have the problem of making sure their business model is resistant to non-paying users. Netflix can’t stop me from pointing a camcorder at my TV while playing their movies, and distributing it out like that. But, somehow, that fact isn’t catastrophic to their business model for whatever reason, I guess.
Cloudflare can try to ban bad actors. I’m not sure if it is cloudflare, but as someone who usually browses without JavaScript enables I often bump into “maybe you are a bot” walls. I recognize that I’m weird for not running JavaScript, but eventually their filters will have the problem where the net that captures bots also captures normal people.
That is true. But robots.txt is not designed to give them the ability to prevent this.
Corporate America. Where clean code goes to die.
Interested to see some LLM-adverserial equivalent of MPAA dots![1]
If sites want to avoid people using agents, they should offer the functionality that people are using the agents to accomplish.
Everyone having a personal shopper obviously changes the relationship to the products and services you use or purchase via personal shopper. Good, bad, whatever.
"Either pay us $50/month or install our extension, and when prompted, solve any captchas or authenticate with your ID (as applicable) on the given website so we can train on the content.
Likewise, I may prevent certain user-agents to visit my site. If you - say, an AI megacorp - are intentionally spoofing the user-agent to appear as a user, you are also violating consent.
For example - humans can learn, programs can't. The "learning" cop out for LLM-corpos shouldn't be accepted by anyone, let alone by law. Humans have a fair use carve out of the copyright laws, not because it's something axiomatic, it's because some humans with empathy have forced others to allow all humans a leeway in legally using other's IP works. Just because such law exist for humans, doesn't mean that random computer programs should be applicable to it. Scraping web for LLMs should not be considered "fair use" because a) it is clearly not (commercialized later) and b) programs aren't humans and don't have equal rights.
And the list goes on. Now, I do get that train has long left the station and we are all collectively living in the anecdote about stealing a bicycle and asking god for forgiveness. But that doesn't mean I agree with this state. I'm just shouting my displeasure towards that passing train cause I'm weird like that. It's like with climate change - we are doing nothing that matters, no one discusses what really matters and I just accepted that nothing will really change. Doesn't mean I like the situation.
PS: tl;dr - LLMs clearly should be legal, it's just simple code is all. LLM corporations who steal IP content without compensation to the authors should be illegal, but of course they won't ever be.
PPS: there is a huge, gigantic gap between a single person scraping a few thousand pages for a personal use, maybe even some small local commercial use (though that's a grey area already) and a billion dollar megacorp, intent on destroying everything of value for humans in the internet for profit.
This is why I care if my ideas are presented to others by an LLM (that maybe cites me in some % of cases) or directly to a human. There is already a difference between a human visiting my space (acknowledging it as such) to read and learn information and being a footnote reference that may or may not be read or opened, without an immediate understanding of which information comes from me.
Even if someone were to do it out of sheer passion without a care for financial gains, I'm sure they'd still appreciate basic validation and recognition. That's like the cheapest form of payment you could give for someone's work.
I don't understand why "actually, you're egotistical if you dare to desire recognition for stuff you put love and effort to" is such a common argument in those discussions. People are treated like machines that should swallow their pride and sense of self for the greater good, while on the other end, there is a (not saying YOU in particular did it) push to humanize LLMs.
Hah, I can see how you would have read it that way. Quite the opposite. I don't use AI tools for my writing. Hackaday and Adafruit have both featured my posts, and their posts were pretty clearly AI-generated.
When you swap in an AI and ask what are the current stories. The AI fetches the front page and every thread and feeds it back to you. You are less likely to participate in discussion because you've already had the info summarized.
What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?
Pinky promises? Ethics? Laws? Technical limitations? Leeroy Jenkins?
You say this as though all LLM/otherwise automated traffic is for the purposes of fulfilling a request made by a user 100% of the time which is just flatly on-its-face untrue.
Companies make vast amounts of requests for indexing purposes. That could be to facilitate user requests someday, perhaps, but it is not today and not why it's happening. And worse still, LLMs introduce a new third option: that it's not for indexing or for later linking but is instead either for training the language model itself, or for the model to ingest and regurgitate later on with no attribution, with the added fun that it might just make some shit up about whatever you said and be wrong. And as the person buying the web hosting, all of that is subsidized by me.
"The web is changing" does not mean every website must follow suit. Since I built my blog about 2 internet eternities ago, I have seen fad tech come and fad tech go. My blog remains more or less exactly what it was 2 decades ago, with more content and a better stylesheet. I have requested in my robots.txt that my content not be used for LLM training, and I fully expect that to be ignored because tech bros don't respect anyone, even fellow tech bros, when it means they have to change their behavior.
Magazines and newspapers were able to by funded by native ads because you couldn't auto-remove ads from their printed media and nobody could clone their content and give it away for free.
It’s especially stupid because it doesn’t include publishers in the equation at all. It’s just you looping over yourself attempting to validate your choice for running an ad blocker.
Admit you’re doing it because you want to callously screw over publishers. You certainly haven’t put their thoughts into consideration here.
To be clear: Run an ad blocker if you want, but stop acting as if you bought those ads. The chicken dinner I ate the other night has no say how I live my life after our transaction has ended.
There is an user agent for search that you can control in robots.txt.
user-agent: Googlebot
There is another user agent for AI training. user-agent: Google-Extended
You can put up a paywall depending on UserAgent or OS (has been done).
In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.
Excellent. Personal shoppers are 'adblock for IRL'.
>You owe the companies nothing. You especially don't owe them any courtesy. They have re-arranged the world to put themselves in front of you. They never asked for your permission, don't even start asking for theirs.
If store businesses at least partially relies on obscurity of information that can be solved through automated means (e.g. storefronts tend to push visitors towards products they don't want, and buyer agents are fighting that and looking for something buyers instructed them) just playing this cat and mouse game of blocking agents, finding workarounds, and repeating the cycle is only creating perverse technological contraptions that neither party is really interested in - but both are circumstantially forced to invest into.
Its a clear road for disaster. I am honestly surprised by how great Hackernews is, to that comparison where most people are sharing it for the love of the craft as an example. And for that hackernews holds a special place in my heart. (Slightly exaggerating to give it a thematic ending I suppose)
And those ads don't spy. They tend to be a jpg that functions as a link. That's why I mentioned spying.
Publishing on a personal blog is not the path.
LLM's aren't taking away from your "prestige" or recognition. Any more than a podcaster referencing an idea of yours without mentioning you is. Or anyone else in casual conversation.
Am I supposed to spend money on Amazon.com when I visit the website just because Amazon wants me to?
HTTP/1.1 402 Payment Required
WWW-price: 0.0000001 BTC, 0.000001 ETH, 0.00001 DOGE
> You are less likely to participate in discussionyou (or AI on your behalf) paid instead. Many sites would probably like it better.
Do you still see authentic human traffic on your domains, is it easy to discern?
I feel like I missed the bus on running a blog pre-AI.
And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.
They claim that since they are free to not buy an advertised product, why would they be forced to see ads for it. But Foo news claims that they are also free to not waste bandwidth to serve their free website to people who declare (by using an ad blocker or the modern alternative: AI aummarizera) they won't participate in the funding of the service
What prevents anyone else? robots.txt is a request, not an access policy.
If you give them a URL that does not appear in Google, ask them to visit that URL specifically, and then notice the content from that URL in the training data, it's proof that they're doing this, which would be quite damaging to them.
When ads were far less invasive, I had a lot more tolerance.
Now they want my data, they want to play audio, video, hijack the content, page etc.
Advertising scum can not be trusted to forever take more and more and more.
Are website owners obligated to serve content to AI agents and/or LLM scrapers?
Is it? It's damning, but is it damaging at all?
I'm not getting the impression that anyone's data being available for training if some bot can get to it is just how things are now, rather than an unsettled point of contention. There's too much money invested in this thing for any other outcome, and with the present decline of the rule of law…
And yes, a podcaster talking about someone's idea without referencing it is an unethical behavior.
What a bleak view of the world.
Fundamentally it's not true that the moment I publish something on the internet, I lose control of who can consume my intellectual property. Licensing, for example, is a way we regulate the way that code or prose can be consumed even if public.
Also expressing my consent is not in any way a way to control others, is a way to control my ideas, my writing, my [whatever] and people are not automatically entitled to it because it's published on the internet.
So overall I understand your position, but I so much disagree with it.
Websites are not "public resources"; site operators just mostly choose to allow the general public to access them. There's no legal requirement that they do so.
If you want anti-discrimination laws that apply to businesses to also cover bots, that is well outside of current law. A site operator can absolutely morally and legally decide they do not allow non-human visitors, just like a store can prohibit pets.
If most people stop discussing things on HN, and the discussion is indeed one of the major reasons it’s kept running, then HN stops being worth running.
There are so many links I click on these days that are such trash I'd be demanding refunds constantly.
Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.
Looks like my homepage which doesn't have anything but links is pretty popular with crawlers. My digital garden doesn't get much interest from them. All in all, human traffic on my sites are pretty much alive.
I don't believe in missing the bus in anything actually, because I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open. I post links to both when it's appropriate, but they are not made to be popular. If people read it and learn something or solve one of their problems, that's enough for me.
This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.
Lastly, this is why I don't want AI crawlers in my site and my data in the models. This thing is made by a human for humans, absolutely for free. It's not OK somebody to sell something designed to be free and make money over it.
you could go proper insanomode, too. remaking The Internet is trivial if you don't care about existing web standards -- replacing HTTP with your own TCP implementation, getting off html/js/css, etc. being greenfield, you can control the protocol, server, and client implementation, and put it in whatever language you want. I made a stateful Internet implementation in Python earlier for proof-of-concept, but I want to port it and expand on it in rust soon (just for fun; I don't do serious biznos). you'll very likely have 100% human traffic then, even if you're the only person curious and trusting enough to run your client.
I think this is a pretty different scenario. Here the user and the news website are talking directly to each other, but then the user is making a choice around what to do with the content the news website send to them. With AI agents, there is a company inserting themselves between the user and the news website and acting as a middleman.
It seems reasonable to me that the news website might say they only want to deal with users and not middlemen.
Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.
I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?
Makes no sense whatsoever.
If I buy stuff at a grocery store, I can’t get a random bagger fired just because I feel like it. At some point the transaction ends and they ultimately continue to operate with or without your input.
That is why AI "summarization" becomes a necessary intermediate layer. You'd not see nor trash nor ads, and thus the payment instead of being exposed to the ads. AI saves the Internet :)
What I want to stop is excessive crawling and scraping of my server. Once they have the file they can do what they want with it. Another comment (44786237) mentions that robots.txt is only for restricting recursive access; I agree and that is what should be blocked. They also should not access the same file several times quickly even though it should be unnecessary to do so, just as much as they should not access all of the files. (If someone wants to make a mirror of the files, there may be other ways, e.g. in case there is a archive file available to download many at once (possibly, in case if the site operator made their own index and then did it this way). If it is a git repository, then it can be cloned.)
I never saw people bother with scissors but I've seen people pulling the ads out of the newspaper countless times.
If you don't have the funds to sue an AI corp, I'd probably think of a plan B. Maybe poison the data for unauthenticated users. Or embrace the inevitability. Or see the bright side of getting embedded in models as if you're leaving your mark.
I’m ok with this. I support the media I truely want to see, and that media offers alternatives that are not ads.
For instance, I pay for YouTube premium. That said, many will not pay.
Licensing is much much more limited than you seem to be thinking of it. For instance, you said explicitly you want a way to control your ideas. The only thing this can mean is a way to control who gets to use your ideas, or what they get to use them for. So if I express a political idea in a novel way or tell a funny joke or something I should be able to dictate who gets to repeat it, or in this case with LLMs who gets to summarise and describe it.
This kind of control is antithetical to the spirit of the internet and would be frankly evil if people were actually able to assert it. Luckily in most cases it's impossible, nobody can actually stop me from describing a movie to my friends or from reposting a meme. Just copying and reposting what you wrote verbatim is something we can probably agree is wrong, but that isn't what's up for questioning here. The idea I was actually replying to in the first place was that you can decide somebody can't read your ideas - even if they're public - just because you don't like them or you don't like what they will do with them. It is hard to think of a more egregious kind of 1984-style censorship, really.
There is a place for regulation of LLM companies, they are doing a lot of harm that I wish governments would effectively rein in. It would not be hard if the political will existed. But this idea of saying I should be able to "control my ideas" is way, way worse.
> I made a stateful Internet implementation in Python earlier for proof-of-concept
Is there a repo or some other form of public access? I'd like to see this.Absolutely, I'm in agreement here. I want to run a JS-free blog, just plain old static HTML. I plan to use GoAccess to parse the access logs but that's it. I think I would find it encouraging to see real human traffic.
> I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open.
That is a great way to view it, thank you.
What if my executive assistant reading the news website and giving me a digest?
Would the website owners rather prefer me doing my reading directly?
People hate obnoxious ads because the money that pays for them is essentially a bribe to artificially elevate content above its deserved ranking. It feels like you're being manipulated into an unfavorable trade.
Big Tech has hidden behind ToS for years. Now, it seems as though it only works for them, but not against. It seems as though this would be easy to orchestrate and prove forcing these companies into a legal nightmare or risk insolvent business stature due to the high load of cases filed against.
Why couldn't something like this be used to flip the table? A conciliation brigading, of sorts.
That is unfortunately not a distinction that is currently legally enforceable. Until that changes all other "solutions" are pointless and only cause more harm.
> People who think like that made tools like Anubis, and it works.
It works to get real humans like myself to stop visiting your site while scrapers will have people whose entire job is to work around such "protections". Just like traditional DRM inconveniences honest customers and not pirates. And to be clear, what you are advocating for is DRM.
> I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
Your license is probably not relevant. I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement. Even if I told it to the whole world, it wouldn't be copyright infringement. Probably the movie seller would prefer it if I didn't tell anyone. Why should I care?
I actually agree that AI companies are generally bad and should be stopped - because they use an exorbitant amount of bandwidth and harm the services for other users. At least they should be heavily taxed. I don't even begrudge people for using Anubis, at least in some cases. But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue. We have laws against copyright infringement, and to prevent service disruption. We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index. That would be unethical. Call for a windfall tax if they piss you off so much.
You can block IP's at the host level but there's pretty easy ways around that with proxy networks.
If I am buying Apple products, am I contributing to their ad budget? If so, where does that money end up? Is it likely that some of it will end up as ad revenue on some website? What difference does it make whether or not I block ads? Or the other way around, if I am visiting websites and look at Apple ads but do not buy Apple products, am I contributing to the ad revenue of the websites?
I remember that Samsung was at one time offering to play non-skippable full-screen apps on their newest 8K OLED TVs and their argument was precisely that these ads will reach those rich people who normally pay extra to avoid getting spammed with ads. Or going with your executive assistant example, there are situations where it makes sense to bribe them to get access to you and/or your data. E.g. "evil maid attack".
You're welcome. I'm glad it helped.
> I want to run a JS-free blog, just plain old static HTML.
If you want to start fast until you find a template you want to work with, I can recommend Mataroa [0]. The blog have almost no JS (it binds a couple of keys for navigation, that's it), and it's $10/year. When you feel right in your self-hosted solution, you can move there. It's all Markdown at the end of the day.
> I plan to use GoAccess to parse the access logs but that's it.
That's the only thing I use, too. Nothing else.
If you want to look at what I do, how I do, and reach out to me, the rabbit hole starts from my profile, here.
Wish you all the best, and you may find bliss and joy you never dreamed of!
It is? Are we talking about the same YouTube? I get absolutely useless recommendations, I get un-hooked within a couple videos, and I even keep getting recommendations for the same videos I've literally watched yesterday. Who in the world gets hooked by this??
So here the consent is indeed about what can be done with the data.
In general, it's absolutely the norm that public websites (I.e., unauthenticated) restrict even who can access the data. The simplest example that comes to mind is geoblocking. I have all the rights to say that my website is not made available to anybody in the US, for example. Would you still call that website "public"? Would bypassing the block via a VPN be a violation of my consent? This is mostly a moral discussion I suppose.
But anyway, it's not what's happening here. LLMs access content for the sole purpose of doing something with that content, either training or providing the service to their customers. They are not humans, they are not consumers, they don't simply fetch the content and present it to the users (a much more neutral action, like curl or the browser does). It's impossible to distinguish, in the case of LLMs the act of accessing and the act of using, so the difference you make doesn't apply in my opinion.
the server ("lodge") passes JSON to the client from what are called .branch files. the client receives JSON, parses it, then builds the UI and state representation from the JSON, then stored in that client's memory (self.current_doc and self.page_state in python client).
branches can invoke waterwheel (.ww) files hosted on the lodge. waterwheel files on the lodge contain scripts which define how patches (as JSON) are to be sent to the client. the client updates its state based on the JSON patch it receives. sample .branch and .ww from python implementation (in pastebin so to not make everyone have to scroll through this): https://pastebin.com/A0DEZDmR
This is a false analogy. A correct one would be going to a 1000 movies and creating the 1001th movie with scenes cropped from these 1000 movies and assemble it as a new movie, and this is copyright infringement. I don't think any of the studios would applaud and support you for your creativity.
> But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue.
Why does it have to be always about money? Personally it's not. I just don't want my work to be abused and sold to people to benefit a third party without my consent and will (and all my work is licensed appropriately for that).
> We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index.
This goes both ways. If big corporations can scrape my material without asking me and resell it as an output of a model, I can equally distill their models further and sell it as my own. If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
But that will be copyright infringement, just because they have more money. What angers me is "all is fair game because you're a small fish, and this is a capitalist marketplace" mentality.
If companies can paywall their content to humans that don't pay, I can paywall AI companies and demand money or push them out of my lawn, just because I feel like that. The inverse is very unethical, but very capitalist, yes.
It's not always about money.
P.S.: Oh, try to claim that you can train a model with medical data without any clearance because it'd be unethical to have laws limiting this. It'll be fun. Believe me.
If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.
> And to be clear, what you are advocating for is DRM.
Yes. It's pretty ironic that someone like me who believes in open access prefers a DRM solution to keep companies abusing the small fish, but life is an interesting phenomenon, and these things happen.
> Until that changes all other "solutions" are pointless and only cause more harm.
As an addendum to above paragraph, I'm not happy that I have to insert draconian measures between the user and the information I want to share, but I need a way to signal that I'm not having their ways to these faceless things. What do you propose? Taking my sites offline? Burning myself in front of one of the HQs?
> If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
AI crawlers default to "Public Domain" when they find no licenses. Some of my lamest source code repositories made into "The Stack" because I forgot to add COPYING.md. A fork of a GPLv2 tool I wrote some patches also got into "The Stack", because COPYING.md was not in the root folder of the repository. I'd rather add licenses (which I can accept) to things rather than leave them as-is, because AI companies also eagerly grab things without license.
All licenses I use mandate attribution and continuation of license, at least, and my blog doesn't allow any derivations of from what I have written. So you can't ingest it into a model to be derived and remixed with something else.
Also, advertising does other things than tell you to buy something, and it doesn’t always take the form of banner ads. Apple, for example, does a ton of brand awareness advertising. Affiliate marketing often targets direct transactions. Maybe your goal is to simply start a relationship that might someday lead to a really big purchase.
Often, in the era of SaaS, people advertise to existing customers. Apple does this—they have a TV service and a music service and a cloud service.
There are plenty of reasons for them to advertise after you bought the original product.
But your original point was that customers bought the ads. Maybe they didn’t! Maybe they were given funding by a VC firm and the company decided it wanted to build an audience. Maybe they want to advocate for a political issue.
I think the biggest problem with your argument is that it has tunnel vision and sees advertising as this one dimensional thing, when in reality it takes many forms. Plenty of those forms are bad, but it is not as simple as “I bought a product, now I never want to see an Apple ad ever again.” Many businesses (Amazon, eBay) make most of their money off of customers they’ve already advertised to that they advertise to again and again.
I think you are describing something much more like stable diffusion. This article is about Perplexity, which is much closer to "watch a movie and tell me the plot" than it is like "take these 1000 movies and make a collage". The copyright points are different - stable diffusion are on much shakier ground than perplexity.
> Why does it have to be always about money?
Before I mentioned money I said "because it hurts my feelings". I'm sorry I can't give a more charitable interpretation, but I really do see this kind of objection as "I don't want you to have access to this web page because I don't like LLMs". This is not a principled objection, it is just "I don't like you, go away". I don't think this is a good principle to build the web on.
Obviously you can make your website private, if you want, and that would be a shame. But you can't have this kind of pick-and-choose "public when you feel like" option. By the way I did not mention, but I am ok with people using Anubis and the like as a compromise while the situation remains unjust. But the justification is very important.
> If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
This is probably not a gambit you want to make. You literally can do this, and they would probably like it if you did. You don't want to do that, because the output of LLMs is usually not that good.
In fact, LLM companies should probably be taxed, and the taxes used to fund real human AI-free creations. This will probably not happen, but I am used to disappointment.
> P.S.: Oh, try to claim that you can train a model with medical data
Medical data is not public, for good reasons.
> The simplest example that comes to mind is geoblocking.
Do you think it is alright to geoblock people, for arbitrary reasons? It is one thing when GDPR imposes a legal obligation on you for serving content in a particular way. Note that that actually doesn't prevent you from seeing the content, it just prevents you from being served by that server. The distinction is important - circumventing a geoblock is something I think should be legally protected.
> They are not humans, they are not consumers, they don't simply fetch the content and present it to the users
They simply fetch the content, run it through a software, and present it to the user. As far as you, the service owner, are concerned, they are simply fetching the content for the user. It is none of your business what the user and the AI company go on to do with "your content".
I've successfully used conciliation court against large corporations in the past which is why I question it here.
And while this should be able to be handled via legislation it won't be. Beyond that a workaround could force that to happen.
It's not invisible, the sites using it don't work perfectly well for all users and it doesn't stop AI crawlers.
No, they are not like browsers. The browser access my content in a transparent way. An LLM reuses the information and acts as an opaque intermediary which - maybe - will at most add a reference to my content.
> I never said that an LLM does anything of its own volition
It doesn't matter why it does what it does, it matters what it does. Your previous comment stressed the idea that it's possible to regulate _what can be done_ with my intellectual property (licensing), but not who can access it, once made it public. What I am saying is that this is exactly the case for LLMs, who _use_ my intellectual property, they are not a tool to _access_ it (like a browser).
> Do you think it is alright to geoblock people, for arbitrary reasons?
Yes. Why wouldn't it be? And if you believe it's not, where do you draw the line? Once you share a picture with your partner, everyone has the right to see it? Or if you share it with your group of friends? Or if you share it on a private social media profile (where you have acquaintances)? When does the audience turn from "a restricted group" to "everyone"? Or why would it be different with my blog? If I want my blog accessible only from my country, I can absolutely do that and there is nothing wrong with it at all. Nobody is entitled to my intellectual property. Obviously I am playing devil's advocate, but this was to say that the fact that something is public, doesn't mean it's unrestricted. And don't get me started on "the spirit of the internet". I can't imagine something breaking that spirit more than LLMs acting as interface between people and the other people on the internet. That spirit is gone, and belongs to a time when the internet was tiny. When OpenAI and company will respect the "spirit of the internet", maybe I will think about doing the same.
> As far as you, the service owner, are concerned, they are simply fetching the content for the user. It is none of your business what the user and the AI company go on to do with "your content".
No, as far as I am concerned the program can take my information, summarize, change, distort, misinterpret it and then present it back to its user. This can happen with or without the user ever knowing that the information can from me. Considering this equal to the user accessing the information is something I simply will not concede and is a fundamental disagreement between us, from which many other disagreements stems.
Sorry, I had never heard that term before. You would still have to show standing though. How would you try to prove that their violating your TOS cost you money?
In fact, you did the opposite.
Again, I can't copy and distribute a game Microsoft rents to me. But if I do I can be found held accountable for a ridiculous amount of money. If it's my work of art the terms can dictate who doesn't need to pay and who does. If an LLM is consuming my work of art and now distributing it within their user base how is that not the same?
We can even go one step further, if anyone is screwing over websites, then that is the ad industry by not paying for blocked ads. I buy an iPhone and Apple takes some additional money from me to spend on advertising. I did not ask for that but I am fine with it. Now I expect Apple to spend the money they took from me on ads in order to support websites. But if the guy that Apple wants to show the ad that I paid for does not want to see it and blocks it, then I want Apple to respect that and still pay the website. I know, not going to happen, but do not put the blame on people blocking ads.