Tuesday August 12 2025

Hacker Times

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

Listen to this article (with local TTS)

2025-08-04

5 min read

We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.

The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behavior, which is incompatible with those preferences, we have de-listed them as a verified bot and added heuristics to our managed rules that block this stealth crawling.

How we tested

We received complaints from customers who had both disallowed Perplexity crawling activity in their robots.txt files and also created WAF rules to specifically block both of Perplexity’s declared crawlers: PerplexityBot and Perplexity-User. These customers told us that Perplexity was still able to access their content even when they saw its bots successfully blocked. We confirmed that Perplexity’s crawlers were in fact being blocked on the specific pages in question, and then performed several targeted tests to confirm what exact behavior we could observe.

We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

robots.txt file on our text website

We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

Perplexity answering questions about our test website that should have not been accessible by Perplexity

Perplexity not checking for the presence of a robots.txt file

Obfuscating behavior observed

Bypassing Robots.txt and undisclosed IPs/User Agents

Our multiple test domains explicitly prohibited all automated access by specifying in robots.txt and had specific WAF rules that blocked crawling from Perplexity’s public crawlers. We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked.

Declared	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)	20-25m daily requests
Stealth	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36	3-6m daily requests

Both their declared and undeclared crawlers were attempting to access the content for scraping contrary to the web crawling norms as outlined in RFC 9309.

This undeclared crawler utilized multiple IPs not listed in Perplexity’s official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare. In addition to rotating IPs, we observed requests coming from different ASNs in attempts to further evade website blocks. This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.

An example:

Perplexity crawling workflow based on observations

Of note: when the stealth crawler was successfully blocked, we observed that Perplexity uses other data sources — including other websites — to try to create an answer. However, these answers were less specific and lacked details from the original content, reflecting the fact that the block had been successful.

How well-meaning bot operators respect website preferences

In contrast to the behavior described above, the Internet has expressed clear preferences on how good crawlers should behave. All well-intentioned crawlers acting in good faith should:

Be transparent. Identify themselves honestly, using a unique user-agent, a declared list of IP ranges or Web Bot Auth integration, and provide contact information if something goes wrong.

Be well-behaved netizens. Don’t flood sites with excessive traffic, scrape sensitive data, or use stealth tactics to try and dodge detection.

Serve a clear purpose. Whether it’s powering a voice assistant, checking product prices, or making a website more accessible, every bot has a reason to be there. The purpose should be clearly and precisely defined and easy for site owners to look up publicly.

Separate bots for separate activities. Perform each activity from a unique bot. This makes it easy for site owners to decide which activities they want to allow. Don’t force site owners to make an all-or-nothing decision.

Follow the rules. That means checking for and respecting website signals like robots.txt, staying within rate limits, and never bypassing security protections.

More details are outlined in our official Verified Bots Policy Developer Docs.

OpenAI is an example of a leading AI company that follows these best practices. They clearly outline their crawlers and give detailed explanations for each crawler’s purpose. They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.

When we ran the same test as outlined above with ChatGPT, we found that ChatGPT-User fetched the robots file and stopped crawling when it was disallowed. We did not observe follow-up crawls from any other user agents or third party bots. When we removed the disallow directive from the robots entry, but presented ChatGPT with a block page, they again stopped crawling, and we saw no additional crawl attempts from other user agents. Both of these demonstrate the appropriate response to website owner preferences.

BLOG-2879 - 6

How can you protect yourself?

All the undeclared crawling activity that we observed from Perplexity’s hidden User Agent was scored by our bot management system as a bot and was unable to pass managed challenges. Any bot management customer who has an existing block rule in place is already protected. Customers who don’t want to block traffic can set up rules to challenge requests, giving real humans an opportunity to proceed. Customers with existing challenge rules are already protected. Lastly, we added signature matches for the stealth crawler into our managed rule that blocks AI crawling activity. This rule is available to all customers, including our free customers.

What’s next?

It's been just over a month since we announced Content Independence Day, giving content creators and publishers more control over how their content is accessed. Today, over two and a half million websites have chosen to completely disallow AI training through our managed robots.txt feature or our managed rule blocking AI Crawlers. Every Cloudflare customer is now able to selectively decide which declared AI crawlers are able to access their content in accordance with their business objectives.

We expected a change in bot and crawler behavior based on these new features, and we expect that the techniques bot operators use to evade detection will continue to evolve. Once this post is live the behavior we saw will almost certainly change, and the methods we use to stop them will keep evolving as well.

Cloudflare is actively working with technical and policy experts around the world, like the IETF efforts to standardize extensions to robots.txt, to establish clear and measurable principles that well-meaning bot operators should abide by. We think this is an important next step in this quickly evolving space.

BLOG-2879 - 7

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

Cloudforce One Threat Intelligence AI Bots Bots AI Bot Management Security Generative AI

Discussion (94 comments)

JimDabell•8 days ago

Their test seems flawed:

> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

Under this situation Perplexity should still be permitted to access information on the page they link to.

robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

— https://www.robotstxt.org/faq/what.html

If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.

These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.

There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.

If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.

Loading replies...

warkdarrior•8 days ago

But there is: https://developers.google.com/search/docs/crawling-indexing/...

There is an user agent for search that you can control in robots.txt.

    user-agent: Googlebot

There is another user agent for AI training.

    user-agent: Google-Extended

trhway•8 days ago

With all the crypto development how come we haven't got to

  HTTP/1.1 402 Payment Required
  WWW-price: 0.0000001 BTC, 0.000001 ETH, 0.00001 DOGE

> You are less likely to participate in discussion

you (or AI on your behalf) paid instead. Many sites would probably like it better.

Wilder7977•7 days ago

> like web browsers, acting on user instruction.

No, they are not like browsers. The browser access my content in a transparent way. An LLM reuses the information and acts as an opaque intermediary which - maybe - will at most add a reference to my content.

> I never said that an LLM does anything of its own volition

It doesn't matter why it does what it does, it matters what it does. Your previous comment stressed the idea that it's possible to regulate _what can be done_ with my intellectual property (licensing), but not who can access it, once made it public. What I am saying is that this is exactly the case for LLMs, who _use_ my intellectual property, they are not a tool to _access_ it (like a browser).

> Do you think it is alright to geoblock people, for arbitrary reasons?

Yes. Why wouldn't it be? And if you believe it's not, where do you draw the line? Once you share a picture with your partner, everyone has the right to see it? Or if you share it with your group of friends? Or if you share it on a private social media profile (where you have acquaintances)? When does the audience turn from "a restricted group" to "everyone"? Or why would it be different with my blog? If I want my blog accessible only from my country, I can absolutely do that and there is nothing wrong with it at all. Nobody is entitled to my intellectual property. Obviously I am playing devil's advocate, but this was to say that the fact that something is public, doesn't mean it's unrestricted. And don't get me started on "the spirit of the internet". I can't imagine something breaking that spirit more than LLMs acting as interface between people and the other people on the internet. That spirit is gone, and belongs to a time when the internet was tiny. When OpenAI and company will respect the "spirit of the internet", maybe I will think about doing the same.

> As far as you, the service owner, are concerned, they are simply fetching the content for the user. It is none of your business what the user and the AI company go on to do with "your content".

No, as far as I am concerned the program can take my information, summarize, change, distort, misinterpret it and then present it back to its user. This can happen with or without the user ever knowing that the information can from me. Considering this equal to the user accessing the information is something I simply will not concede and is a fundamental disagreement between us, from which many other disagreements stems.