Mar 10, 2026
You can now crawl an entire website with a single API call using Browser Rendering's new /crawl endpoint, available in open beta. Submit a starting URL, and pages are automatically discovered, rendered in a headless browser, and returned in multiple formats, including HTML, Markdown, and structured JSON. This is great for training models, building RAG pipelines, and researching or monitoring content across a site.
Crawl jobs run asynchronously. You submit a URL, receive a job ID, and check back for results as pages are processed.
# Initiate a crawlcurl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://blog.cloudflare.com/" }'# Check resultscurl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \ -H 'Authorization: Bearer <apiToken>'
Key features:
modifiedSince and maxAge to skip pages that haven't changed or were recently fetched, saving time and cost on repeated crawlsrender: false to fetch static HTML without spinning up a browser, for faster crawling of static sitesrobots.txt directives, including crawl-delayAvailable on both the Workers Free and Paid plans.
To get started, refer to the crawl endpoint documentation. If you are setting up your own site to be crawled, review the robots.txt and sitemaps best practices.
Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.
From the behaviour of our peers, this seems to be the real headline news.
Is it possible to ignore robot.txt in the case the crawl was triggered by a human?
Sounds pretty useless for any serious AI company
And they can pull it off because of their reach over the internet with the free DNS.
If more sites provided explicit machine-readable entry points for crawlers, indexing could become a lot less wasteful. Right now crawlers spend a lot of effort rediscovering the same structure over and over.
It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.
Workers-originated requests include a CF-Worker header identifying the workers subdomain, which distinguishes them from regular CDN proxying. You can match on this in a WAF rule or origin middleware.
The trickier issue: rendered requests originate from Cloudflare ASN 13335 with a low bot score, so if you rely on CF bot scores for content protection, requests through their own crawl product will bypass that check. The practical defense is application-layer rate limiting and behavioral analysis rather than network-level scores -- which is better practice regardless.
The structural conflict is real but similar to search engines offering webmaster tools while running the index. The incentives are misaligned, but the individual products have independent utility. The harder question is whether the combination makes it meaningfully harder to build effective bot protection on top of their platform.
``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ```
Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura
All of a sudden, about 1/3 of all traffic to our website is being routed via EWR (New York) - me included -, even tough all our users and our origin servers are in Brazil.
We pay for the Pro plan but support has been of no help: after 20 days of 'debugging' and asking for MTRs and traceroutes, they told us to contact Claro (which is the same as telling me to contact Verizon) because 'it's their fault'.
First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.
Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.
The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.
I'll need to test it out, especially with the labyrinth.
I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space.
I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers.
And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure.
So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ?
Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's
Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.
And once that is well setup, and they have their walled garden, then they can present their own API to scrape websites. All well done to be used by your LLM. But as you know, they are the gate keeper so that the Mafia boss decide what will be the "intermediary" fee that is proper for itself to let you do what you were doing without intermediary before.
Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.
Cache footprint management is a huge factor in the cost and performance for a CDN, you want to get the most out of your storage and you want to serve as many pages from cache as possible.
I know in my experience working for a CDN, we were doing all sorts of things to try to maximize the hit rate for our cache.. in fact, one of the easiest and most effective techniques for increasing cache hit rate is to do the OPPOSITE of what you are suggesting; instead of pre-caching content, you do ‘second hit caching’, where you only store a copy in the cache if a piece of content is requested a second time. The idea is that a lot of content is requested only once by one user, and then never again, so it is a waste to store it in the cache. If you wait until it is requested a second time before you cache it, you avoid those single use pages going into your cache, and don’t hurt overall performance that much, because the content that is most useful to cache is requested a lot, and you only have to make one extra origin request.
> Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.
You could try to gate this behind access controls but at that point you have reinvented a clunky bespoke CDN API that no site owner asked for, plus a fresh legal mess. Static file caches work because they only ever respond to the original request, not because they claim to own or index your content.
It is a short path from "helpful pre-scraped JSON" to handing an entire site to an AI scraper-for-hire with zero friction. The incentives do not line up unless you think every domain on Cloudflare wants their content wholesale exported by default.
It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.
Isn't this solving a slightly, but very significantly different problem?
You could serve the very same data in two different ways: One to present to the users and one to hand over to scrapers. Of course, some sites would be too difficult or costly to transform into a common underlying cache format, but people who WANT their sides accessible to scrapers could easily help the process along a bit or serve their site in the necessary format in the first place.
But the key is:
A tool using a "pre-scraped" version of a site has very likely very different requirements of how a CDN caches this site. And this could be easily customizable by those using this endpoint.
Want a free version? Ok, give us the list of all the sites you want, then come back in 10min and grab everything in one go, the data will be kept ready for 60s. Got an API token? 10 free near-real-time request for you and they'll recharge at a rate of 2 per hour. Want to play nice? Ask the CDN to have the requested content ready in 3 hours. Got deep pockets? Pay for just as many real-real-time requests as you need.
What makes this so different is that unless customers are willing to hand over a lot of money, you dont need to cache anything to serve requests at all. Potentially not even later if you got enough capacity to serve the data for scheduled requests from the storage network directly.
You just generate an immediate promise response to the request telling them to come back later. And depending on what you put into that promise, you've got quite a lot of control over the schedule yourself.
- Got a "within 10min" request but your storage network has plenty if capacity in 30s? Just tell them to come back in 30s.
- A customer is pushing new data into your network around 10am and many bots are interested in getting their hands on it as soon as possible, making requests for 10am to 10:05? Just bundle their requests.
- Expected data still not around at 10:05? Unless the bots set an "immediate" flag (or whatever) indicating that they want whatever state the site is in right now, just reply with a second promise when they come back. And a third if necessary... and so on.
They also use their dominant position to apply political pressure when they don’t like how a country chooses to run things.
So yeah, we’ve created another mega corp monster that will hurt for years to come.
But semantic HTML is exactly that explicit machine-readable entrypoint. I am firmly entrenched in the opinion that HTML, and the DOM is only for machines to read, it just happens to be also somewhat understandable to some humans. Take an average webpage, have a look at all characters(bytes) in there: often two third won't ever be shown to humans.
Point being: we don't need to invent something new. We just need to realize we already have it and use it correctly. Other than this requiring better understanding of web tech, it has no downsides. The low hanging fruit being the frameworks out there that should really do a better job of leveraging semantics in their output.
This question raises an interesting question about if this would exacerbate supply chain injection attacks. Show the innocuous page to the human, another to the bot.
Like there's a difference between dozens of drunk teenagers thrashing the city streets in the illegal street race vs a taxi driver.
I've found myself falling pretty hard on the side of making APIs work for humans and expecting LLM providers to optimize around that. I don't need an MCP for a CLI tool, for example, I just need a good man page or `--help` documentation.
If your delay is 1s and you publish less than 60 updates a minute on average I can still get 100%. Most crawls are not that latency sensitive, certainly not the ai ones.
HFT bots, now that is an entirely different ballgame.
The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.
Their DNS is only really for data collection and to front as "good will"
> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".
You don't need any scraping countermeasures for crawlers like those.
Makes you think, right?
That said, I'm not fan of letting users forge whatever user agents they please. Instead, AIUI to opt-out of getting crawled I have to look for the existence of certain request headers[1].
[1]: https://developers.cloudflare.com/browser-rendering/referenc...
You'll still be hand-rolling it if you want to disrespect crawling requirements though.
The post says it's available for both free and paid plans. According to the pricing page of the Browser Rendering, the free plan will have 10 minutes/day browsing time.
Crawl jobs per day 5 per day
Maximum pages per crawl 100 pages
[0] https://developers.cloudflare.com/browser-rendering/limits/#...Further down they also mention that the requests come from CFs ASN and are branded with identifying headers, so third party filters could easily block them too if they're so inclined. Seems reasonable enough.
Refer to Will Browser Rendering bypass Cloudflare's Bot Protection? for instructions on creating a WAF skip rule.
And "Will Browser Rendering bypass Cloudflare's Bot Protection? " is a hash link to the FAQ page, that surprisingly doesn't anything available for this link entry.Is it because it was removed (/hidden) or because it is not yet available until everyone forget the "we are no evil, we are here to protect the internet"?
(Which, on Akamai, are by default ignored!)
A lot of known crawlers will get a crawler-optimized version of the page
We're creating an internet that is becoming self-reinforcing for those who already have power and harder for anyone else. As crawling becomes difficult and expensive, only those with previously collected datasets get to play. I certainly understand individual sites wanting to limit access, but it seems unlikely that they're limiting access to the big players - and maybe even helping them since others won't be able to compete as well.
You feel better paying someone to do the same thimg?
And forget about crawling. If you have a less reputable IP (basically every IP in third world countries are less reputable, for instance), you can be CAPTCHA'ed to no end by Cloudflare even as a human user, on the default setting, so plenty of site owners with more reputable home/office IPs don't even know what they subject a subset of their users to.
[1] E.g. https://www.wired.com/robots.txt to pick an example high up on HN front page.
They certainly behave like they are. We constantly see crawlers trying to do cache busting, for pages that hasn't change in days, if not weeks. It's hard to tell where the bots are coming from theses days, as most have taken to just lie and say that they are Chrome.
I'd agree that the respecting robots.txt makes this a non-starter for the problematic scrapers. These are bots that that will hammer a site into the ground, they don't respect robots.txt, especially if it tells them to go away.
All of this would be much less of a problem if the authors of the scrapers actually knew how to code, understood how the Internet works and had just the slightest bit of respect for others, but they don't so now all scrapers are labeled as hostile, meaning that only the very largest companies, like Google, get special access.
30% of the web might use their caching services. 'Relies on' implies that it wouldn't work without them, which I doubt is the case.
It might be the case for the biggest 1% of that 30%. But not the whole lot.
I'm split between: Yes! At last something to get CF protected sites! And: Uh! Now the internet is successfully centralized.
They are not super helpful fixing it either.
It's hard to see how this isn't extorting folks by offering a working solution that, oh, cloudflare doesn't block. As long as you pay Cloudflare.
Perhaps I'm overly cynical, but I'd be quite surprised if cloudflare subjected their own headless browsing to the same rules the rest of the internet gets.
Also, I am genuinely open to feedback (Like a lot) so just let me know if you know of any other alternative too for the particular thing that I wish to create and I would love to have a discussion about that too! I genuinely wish that there can be other ways and part of the reason why I wrote that comment was wishing that someone who manages forums or knows people who do can comment back and we can have a discussion/something-meaningful!
I am also happy with you also suggesting me any good use cases of the domain in general if there can be made anything useful with it. In fact, I am happy with transferring this domain to you if this is something which is useful to ya or anyone here (Just donate some money preferably 50-100$ to any great charity in date after this comment is made and mail me details and I am absolutely willing to transfer the domain, or if you work in any charity currently and if it could help the charity in any meaningful manner!)
I had actually asked archive team if I could donate the domain to them if it would help archive.org in any meaningful way and they essentially politely declined.
I just bought this domain because someone on HN said mirror.org when they wanted to show someone else mirror and saw the price of the .org domain being so high (150k$ or similar)and I have habit of finding random nice TLD and I found mirror.forum so I bought it
And I was just thinking of hmm what can be a decent idea now that I have bought it and had thought of that. Obviously I have my flaws (many actually) but I genuinely don't wish any harm to anybody especially those people who are passionate about running independent forums in this centralized-web. I'd rather have this domain be expired if its activation meant harm to anybody.
looking forward to discussion with ya.
Scrappers seem to be exceedingly careless in using public resources. The problem is often not even DDOS (as in overwhelming bandwidth usage) but rather DOS through excessive hits on expensive routes.
Do you have a source for this? Not saying you're wrong, I'd just like to know more
Don't need to ask anything i can tell you exactly - because they have no regard for anything but their own profit.
Let me give you an example of this mom and pop shop known as anthropic.
You see they have this thing called claudebot and at least initially it scraped iterating through IP's.
Now you have these things called shared hosting servers, typically running 1000-10000 domains of actual low volume websites on 1-50 or so IPs.
Guess what happens when it is your networks time to bend over? Whole hosting company infrastructure going down as each server has hundreds of claudebots crawling hundreds of vhosts at the same time.
This happened for months. Its the reason they are banned in WAFs by half the hosting industry.
Last time Cloudflare went down, their dashboard was also unavailable, so you couldn't turn off their proxy service anyway.
https://robindev.substack.com/p/cloudflare-took-down-our-web...
HN Discussion:
Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.
The docs are pretty equivocal though:
>If you use Cloudflare products that control or restrict bot traffic such as Bot Management, Web Application Firewall (WAF), or Turnstile, the same rules will apply to the Browser Rendering crawler.
It's not just robots.txt. Most (all?) restrictions that apply to outside bots apply to cloudflare's bot as well, at least that's what they're claiming. If they're being this explicit about it, I'm willing to give them the benefit of the doubt until there's evidence to the contrary, rather than being a cynic and assuming the worst.
To me the current behavior of those scrapers tells me that "they don't plan", period.
Looks like they hired a bunch of excavators and are digging 2 meters deep on whole fields, looking for nuggets of gold, and pilling the dirt on a huge mountain.
Once they realize the field was bereft of any gold but full of silver? Or that the gold was actually 2.5 meters deep?
They have to go through everything again.
If I need to treat cloudflare bots the same as malicious bots, that undermines their claim.