So I started searching for what these residential proxy networks actually are.
https://datadome.co/bot-management-protection/how-proxy-prov...
I was hit with a pretty substantial botnet "distributed scraping" attack yesterday.
- About 400,000 different IP addresses over about 3 hours
- Mostly residential IP addresses
- Valid and unique user agents and referrers
- Each IP address would make only a few requests with a long delay in between requests
It would hit the server hard until the server became slow to respond, then it would back off for about 30 seconds, then hit hard again. I was able to block most of the requests with a combination of user agent and referrer patterns, though some legit users may be blocked.
The attack was annoying, but, the even bigger problem is that the data on this website is under license - we have to pay for it, and it's not cheap. We are able to pay for it (barely) with advertising revenue and some subscriptions.
If everyone is getting this data from their "agent" and scrapers, that means no advertising revenue, and soon enough no more website to scrape, jobs lost, nowhere for scrapers to scrape for the data, nowhere for legit users to get the data for free, etc.
I'll implement Anubis at low difficulty for all my projects and leave a decent llms.txt referenced in my sitemap and robots.txt so LLMs can still get relevant data for my site while.keeping bad bots out. I'm getting thousands of requests from China that have really increased costs, glad it seems the fix is rather easy.
In addition to pulling responses with huge amplification (40x, at least, for posting a single Facebook post to an empty audience), it's sending us traffic with fbclids in the mix. No idea why.
They're also sending tons of masked traffic from their ASN (and EC2), with a fully deceptive UserAgent.
The weirdest part though is that it's scraping mobile-app APIs associated with the site in high volume. We see a ton of other AI-training focused crawlers do this, but was surprised to see the sudden change in behavior on facebookexternalhit ... happened in the last week or so.
Everyone is nuts these days. Got DoSed by Amazonbot this month too. They refuse to tell me what happened, citing the competitive environment.
On Safari or Orion it is merely extremely slow to load.
I definitely wouldn't use any of this on a site that you don't want delisted for cryptojacking.
Very annoying. And you can't filter them because they look like legitimate trafic.
On a page with differents options (such as color, size, etc...) they'll try all the combinaisons, eating all the ressources.
Not even a 404, just not available at all.
It looks like it's computing sha256 hashes. Such an ASIC friendly PoW has the downside that someone with ASICs would be able to either overwhelm the site or drive up the difficulty so high that CPUs can never get through.
If you have a logging stack, you can easily find crawler/bot patterns, then flag candidate IP subnets for blocking.
It's definitely whackamole though. We are experimenting with blocking based on risk databases, which run between $2k and $10k a year depending on provider. These map IP ranges to booleans like is_vpn, is_tor, etc, and also contain ASN information. Slightly suspicious crawling behavior or keyword flagging combined with a hit in that DB, and you have a high confidence block.
All this stuff is now easy to homeroll with claude. Before it would have been a major PITA.
What is the point of these anti bot measures if organic HN traffic can nuke your site regardless? If this is about protecting information from being acquired by undesirable parties, then this site is currently operating in the most ideal way possible.
The information will eventually be ripped out. You cannot defeat an army with direct access to TSMC's wafer start budget and Microsoft's cloud infrastructure. I would find a different hill to die on. This is exactly like the cookie banners. No one is winning anything here. Publishing information to the public internet is a binary decision. If you need to control access, you do what Netflix and countless others have done. You can't have it both ways.
Maybe I’m a bot, I gave up waiting before the progress bar was even 1% done.
> "The idea is that at individual scales the additional load is ignorable, ..."
Three minutes, one pixel of progress bar, 2 CPUs at 100%, load average 4.3 ...
The site is not protected by Anubis, it's blocked by it.
Closed.
That’s how fast the landscape is changing.
And remember: while the report might have been released in 2024, it takes time to conduct research and publish. A good chunk of its data was likely from 2023 and earlier.
If a webstie takes so long to verify me I'll bounce. That's it.
Good luck banning yourself from the future.
I love experimental data like this. So much better than gut reaction that was spammed when anubis was just introduced
Is the theory here that OpenAI, Anthropic, Gemini, xAI, Qwen, Z.ai etc are all either running bad scrapers via domestic proxies in Indonesia, or are buying data from companies that run those scrapers?
I want to know for sure. Who is paying for this activity? What does the marketplace for scraped data look like?
We need a better solution.
> Here is a massive log file for some activity in the Data Export tar pit:
A bit of a privacy faux pas, no? Some visitors may be legitimate.
One of the mistakes people assume is that AI capability means humanness. If you know exactly where to look, you can start to identify differences between improving frontier models and human cognition.
One concrete example from a forthcoming blog post of mine:
[begin]
In fact, CAPTCHAs can still be effective if you know where to look.
We ran 75 trials -- 388 total attempts -- benchmarking three frontier AI agents against reCAPTCHA v2 image challenges. We looked across two categories: static, where each image grid is an individual target, and cross-tile challenges, where an object spans multiple tiles.
On static challenges, the agents performed respectably. Claude Sonnet 4.5 solved 47%. Gemini 2.5 Pro: 56%. GPT-5: 23%.
On cross-tile challenges: Claude scored 0%. Gemini: 2%. GPT-5: 1%.
In contrast, humans find cross-tile challenges easier than static ones. If you spot one tile that matches the target, your visual system follows the object into adjacent tiles automatically.
Agents find them nearly impossible. They evaluate each tile independently, produce perfectly rectangular selections, and fail on partial occlusion and boundary-spanning objects. They process the grid as nine separate classification problems. Humans process it as one scene.
The challenges hardest for humans -- ambiguous static grids where the target is small or unclear -- are easiest for agents. The challenges easiest for humans -- follow the object across tiles -- are hardest for agents. The difficulty curves are inverted. Not because agents are dumb, but because the two systems solve the problem with fundamentally different architectures.
Faking an output means producing the right answer. Faking a process means reverse-engineering the computational dynamics of a biological brain and reproducing them in real time. The first problem can be reduced to a machine learning classifier. The second is an unsolved scientific problem.
The standard objection is that any test can be defeated with sufficient incentive. But fraudsters weren't the ones who built the visual neural networks that defeated text CAPTCHAs -- researchers were. And they aren't solving quantum computing to undermine cryptography. The cost of spoofing an iris scan is an engineering problem. The cost of reproducing human cognition is a scientific one. These are not the same category of difficulty.
[end]
I may be missing something of course
JA4 fingerprinting works decently for the residential proxies.
But maybe (and likely for worse) LLMs will finally kill this model.
In that case a better solution would be to take the site down altogether.
It's even dumber than that, because by default anubis whitelists the curl user agent.
curl -H "User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36" "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v7.0-rc5&id2=v7.0-rc4&dt=2"
<!doctype html><html lang="en"><head><title>Making sure you're not a bot!</title><link rel="stylesheet"
vs curl "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v7.0-rc5&id2=v7.0-rc4&dt=2"
<!DOCTYPE html>
<html lang='en'>
<head>
<title>kernel/git/torvalds/linux.git - Linux kernel source tree</title>The click IDs are likely to make the traffic look more like a human who has clicked a link rather than a bot? That way it gets past simple filters that explicitly let such requests in before bothering to check that the source address of the request seems to be a DC rather than a residential IP.
> citing the competitive environment
All the companies are competing to be the biggest inconvenience to everyone else while scraping as much stuff as they can.
Why do you say that?
Digg's recent shutdown message talked about how bad and aggressive bots were. I'd love to see Kevin and Alex post in depth about lessons learned, Dead Internet, and call out social sites.
I think some of the folks running sites would rather have you go to the site and view the items “suggested based on your shopping history” (I consider these ads, the vendors might disagree), etc.
I’m more sympathetic to the people running sites than the LLM training scrapers, but these are two parties in a many-party game and neither one is perfectly aligned with users.
Your key finding is that humans process the grid as one visual scene — but that's a finding about sighted cognition.
Isn't this, like most things, a sensitivity specificity tradeoff?
How many real humans should be blocked from your system to keep the bots out?
What is the Blackstone ratio of accessibility?
Just like fail2ban is not very useful against a DDOS attack where each unique IP only makes a few requests with a large (hour+) delay in between requests. There is no clear "fail" in these requests, and the fail2ban database becomes huge and far too slow.
- 400,000 Unique IP addresses
- 1 to 3 requests per hour per IP addresses - with delays of over 60 minutes between each request.
- Legit request URLs, legit UA & referrer
Maybe Anubis would help, but it's also a risk for various reasons.
Unfortunately, what I think will happen - and indeed already is - is that the AI companies themselves will replace much of the WWW. Sites like the one I am talking about will cease to exist. AI companies, once they can no longer scrape (steal) the data will end up licensing the data themselves and replace us as the distributor to end users. Perhaps as a subscription add-on or also with an ad based model.
Which to some may be fine. Personally, I don't want a few centralized AI companies replacing the hundreds of thousands of independent websites online. Way too much centralized power there.
The idea is to scare off bots and not normal humans.
Should have have maybe prioritized differently...
The root sources of the traffic from residential proxies gets murky very quickly.
It's easy to follow the chain partway for some traffic, eg "Why are we receiving all this traffic from Digital Ocean? ... oh, it's their hero client Firecrawl, using a deceptive UserAgent" ... but it still leaves the obvious question about who the Firecrawl client is.
Res proxy traffic is insane these days. There is also plenty of grey-market snowshoe IPs available for the right price, from a handful of ASNs. I regularly see unified crawling missions by unknown agents using 1000+ "clean" IP addresses an hour.
I bet lot of companies want to provide search results to AI agents.
It also drives home that Anubis needs a time estimate for sites that don't use Anubis as a "can you run javascript" wall but as an actual proof of work mechanism that it purports to be its main mechanism
It shows a difficulty of "8" with "794 kilohashes per second", but what does that mean? I understand the 8 must be exponential (not literally that 8 hashes are expected to find 1 solution on average), but even as a power of 2, 2^8=256 I happen to know by heart, so thousands of hashes per second would then find an answer in a fraction of a second. Or if it's 8 bytes instead of bits, then you expect to find a solution after like 8 million hashes, which at ~800k is about ten seconds. There is no way to figure out how long the expected wait is even if you understand all the text on the page (which most people wouldn't) and know some shortcuts to do the mental math (how many people know small powers of 2 by heart)
I can't believe people are still using this as a generic anti-AI argument even though a decade ago people were insisting that there's no way AI can have the capabilities that frontier LLMs have today. Moreover it's unclear whether the gap even exists. Even if we take the claim that the grid pattern is some sort of fundamental constraint that AI models can't surpass, it doesn't seem too hard to work around by infilling the grids pattern and presenting the 9 images to LLMs as one image.
Weird part #1 is that the traffic isn't for the most part shaped like crawler traffic. It's incredibly bursty, and heavily redundant, missing even the most obvious low hanging fruit optimizations.
Could be someone is using residential proxies to wrap AI agents' web traffic, but even so, there's a lot of pieces that don't really make sense, like why the traffic pattern is like being hit by a shotgun. It isn't just one request, but anywhere between 40 and 100 redundant requests.
A popular theory is that this is because of sloppy coding, AI companies are too rich to care, but then again that doesn't really add up. This isn't just a minor inefficiency, if it is "just" bad coding, they stand to gain monumental efficiency improvements by fixing the issues, in the sense of getting the data much faster, a clear competitive edge.
Really weird.
My unsubstantiated guess is the residential proxy/botnet is very unreliable, and that's why they fire so many request. Makes sense if it's sold as a service.
People have a right to complete anonymity, and should be able to go across the majority of the Internet just as they can go across most of the country.
That’s what you are missing.
Don’t get me wrong, I am also in favour of a single government ID, but in terms of combatting identity fraud, accessing public resources like single-payer healthcare, and making it easier for a person to prove their identity to authorities or employers.
It should not be used as a pass card for fundamental rights that normally would have zero government involvement.
However residential proxies do have a weakness, since they need to maintain 2 separate TCP conenctions you can exploit RTT differences between layers 3 and 7 to detect if the connection to your server is being terminated somewhere along the path. Solutions exist that can reliably detect and block residential proxies, for example: https://layer3intel.com/tripwire
It's certainly possible. However, the traffic is still coming from Facebook's network with a FB proxy PTR record in DNS. Seems much more likely to fool your typical site owner than a bad actor.
for (; ;) {
const hashBuffer = await calculateSHA256(data + nonce);
const hashArray = new Uint8Array(hashBuffer);
let isValid = true;
for (let i = 0; i < requiredZeroBytes; i++) {
if (hashArray[i] !== 0) {
isValid = false;
break;
}
}
It's less proof of work and just annoying to users, and feel good to whoever added it to their site, I can't wait for it to go away. As a bonus, it's based on a misunderstanding of hashcash, because it is only testing zero bytes comparison with a floating point target (as in Bitcoin for example), the difficulty isn't granular enough to make sense, only a couple of the lower ones are reasonably solvable in JavaScript and the gaps between "wait for 90 minutes" and "instantly solved" are 2 values apart.By Jackie Glade • March 28, 2026 • 7128 views
![]()
As you may know, on Glade Art we tend to take anti-bot measures very seriously; it is one of our topmost priorities to protect our fellow users from having their art trained on. We also tend to engage in trolling bots by using endless labyrinths of useless data to trap them in. These are commonly referred to as "honeypots" or "digital tar pits." And so, after 6.8 million requests in the last 55 days at the time of writing this, we have some substantial data, so standby and let us share it with you. : ) > 1. Quick clarification. For starters, these bots do not obey robots.txt. This is expected from unethical companies, but it doesn't make it any better. (A robots.txt file is a plain txt file placed in websites which contains rules of where bots are allowed and disallowed to go. Good bots such as search engine crawlers obey these rules, while bad bots do not). To avoid trapping good bots, we have our robots.txt set to disallow all bots from going into this site's tar pits. > 2. Pages and Contents. The 2 traps on this site which have the most bot activity are these: gladeart(DOT)com(SLASH)data-export (Over 6.8 million requests in the past 55 days). gladeart(DOT)com(SLASH)gro (Over 84k requests in the past 35 days). (NOTE: Use a VPN on these pages if you don't want your IP shown in the logs, but it won't be significant amongst the millions of others anyways). As you can see when visiting the pages, GRO generates more book-like text, while Data Export's text is well... whatever it's supposed to be. Data Export is by far more successful than GRO. It would be safe to assume that these companies are scraping for more number-rich data for better facts and stuff. Fake personal information such as emails or phone numbers seem to also attract scraping very well. > 3. Characteristics of these bots. The IPs of these bots here actually do not come from datacenters or VPNs most of the time; the overwhelming majority come from residential and mobile networks. Asian and Indonesian countries are where nearly all of them reside. By leveraging cheap compute from such countries while using residential IPs, they can appear as completely human traffic to many websites, and scrape at massive scale. However, there is some good news: these bots do not execute JavaScript, at least not when scraping random sites across the entire web. Just imagine the compute costs if they couldn't use headless browsers while scraping millions of sites every hour! This makes PoW challenges extremely effective against them. Website traffic at these scales coming from bots while looking like normal humans begs this question: "How much of the internet's traffic comes from bots?" > 4. How much of the traffic on the internet comes from bots? Reports in 2024 say that approximately 51% of all traffic on the internet comes from bots. Now this sounds like a lot, and it is, but it is much worse than that. This is because these estimate rely heavily on where the IP addresses originate from: whether they come from datacenters or not. As we can see in our data, there is an extremely high amount of bots that don't come from datacenters at all. They can certainly be rigged to execute JavaScript on high quality sites, and many sites don't even require JS, such as Wikipedia and Old Reddit. With this in mind, it wouldn't be unreasonable to assume that the amount of bot traffic on the internet is much higher, perhaps over 70% even. > 5. Some experiments on these bots. Of course we ran some experiments on these bots. Quick fact: Anubis is a program that adds a proof of work challenge to websites before users can access them. And so Anubis was enabled in the tar pit at difficulty 1 (lowest setting) when requests were pouring in 24/7. Before it was enabled, it was getting several hundred-thousand requests each day. As soon as Anubis became active in there, it decreased to about 11 requests after 24 hours, most just from curious humans. Was it a coincidence? No, it was not. It was tested on several other occasions yielding very similar results. As this confirms, bots do not like PoW challenges, even ultra easy ones. If a few do execute JS, extremely little will solve challenges; take the search engine crawler GoogleBot for example. > 6. Who are these bots from? These bots are almost certainly scraping data for AI training; normal bad actors don't have funding for millions of unique IPs thrown at a page. They probably belong to several different companies. Perhaps they sell their scraped data to AI companies, or they are AI companies themselves. We can't tell, but we can guess since there aren't all that many large AI corporations out there. > 7. How can you protect your sites from these bots? If your site has a vast amount of pages, then these bots could potentially raise resource usage for your server when they are crawling through everything. The best options in this case would be Cloudflare or Anubis. Alternatively, you could add a simple JS requirement in your web-server, Nginx for example, (this won't be as effective, but often sufficient for most sites). It would be recommended to add an hCaptcha to forms such as sign ups and similar as well. Overall, a correctly configured Anubis on your site eliminates nearly all bot traffic. > 8. Server resource usage. Our server usage for the tar pit endpoints is quite low. For example, when a global 1000 request per minute rate-limit was being reached in Data Export, the server's CPU usage was not noticeably higher than when idle (i5 4460). The ram usage for it was also very low, much less than 500mb. And since it's just text data being sent out, uploads were no more than 700KiB/s. > 9. Fun fact. So on average, the Data Export tar pit generates 9000 characters per request. Doing the math on that makes the 6.8 million loads equivalent to ~52 billion characters, or over 120,000 novels worth of text generated and sent in total since Jan 29th, 2026. > 10. Download a log file. Here is a massive log file for some activity in the Data Export tar pit: https://mega.nz/file/69Rh3IpS#ThlagHz8e58jLvU-vWn9U9m9T\_WegL4SE0H2mhZRcZY Caution: this file decompresses to about 1.1GB. Standard text editors will struggle to open it. Note: this file contains logs from Jan 29th to March 22nd, 2026. [This is for educational purposes only]. <> Outro. And so, with this information we can see just how bad the bot situation is right now on the internet. Look on the bright side though, trolling bots is fun! We recommend you to add your own tar pits to your site as well; the more volume the better. Just be sure to disallow going into there in your robots.txt so that good bots don't get trapped. Bad bots actually often go into that page because you disallowed it for them. Thank you for reading! : ) <>
A less publicly-visible motive would be if they were building up accounts to use for paid-upvote schemes.
Is not an anti-AI argument, it’s an open and unsolved question. Your optimism is appreciated, but the dismissal and assumption this is already solved is foolish and naive.
My website contains ~6000 unique data points in effectively infinite combinations on effectively infinite pages. Some of those combinations are useful for humans, but the AI-scrapers could gain a near-infinite efficiency improvement by just identifying as a bot and heeding my robots.txt and/or rel="nofollow" hints to access the ~500 top level pages which contain close to everything which is unique. They just don't care. All their efficiency attempts are directed solely toward bypassing blocks. (Today I saw them varying the numbers in their user agent strings: X15 rather than X11, Chrome/532 rather than Chrome/132, and so on...)
What a waste of time.
> Anubis uses a Proof-of-Work scheme in the vein of Hashcash
And if you look up Hashcash on Wikipedia you get https://en.wikipedia.org/wiki/Hashcash which explains how Hashcash works in a fairly straightforward manner (unlike most math pages).
I can substantiate this a bit. Verified traffic from Amazonbot is too dumb to do anything with 429s. They will happily slam your site with more traffic than you can handle, and will completely ignore the fact that over half the responses are useless rate limits.
They say they honor REP, but Amazonbot will still hit you pretty persistently even with a full disallow directive in robots.txt
Why? (Am not trolling. Genuinely interested)
I walk out my front door in the UK and I am not anonymous. Every transaction I make either identifies me through bank, railway or other id, or quite simply by my face standing in front of the coffee seller. My walk down the road is observed by neighbours and postmen.
Should my government arrest me without cause or trample on my free speech rights, I get that’s a problem but I am not sure why being anonymous helps. Having rights upheld by the courts helps, well trained police who respect the law helps.
I am honestly open to debate on this but I do find the “what if Hitler took over government where would we be” to be a problematic argument not a final answer.
this is being disproved in the article posted:
>And so Anubis was enabled in the tar pit at difficulty 1 (lowest setting) when requests were pouring in 24/7. Before it was enabled, it was getting several hundred-thousand requests each day. As soon as Anubis became active in there, it decreased to about 11 requests after 24 hours, most just from curious humans.
apparently it does more than annoying users and making the site owner feel good (well, i suppose effective bot blocking would make the site owner feel quite good)
Are these the government? Is the bank the government? Is the rail company the government?
No? Then you have answered your own question.
A silo of identification between you and a service provider that uses the provider’s own tooling is still anonymity from government authoritarianism.
The fact that nearly all of these silos are leaky IRL - with the government eager to punch howitzer-sized holes through them for even more access - is not the point. It is a citizen-hostile flaw that needs patching through loophole-proof legislation, not an ID system that would violently eradicate any remaining separation of government from capitalism.
Remember: when government and capitalism rides in the same cart, it is called corporatism, and is the basis of Fascism. Which is what is happening to America.
You're suggesting the same government that would violate your rights would then help prevent it? I don't follow. Any power structure (tiered or not) was wiped away by authoritarians, historically. They will not be helping in the worst case. Ideological capture (corruption) has already started eroding at UK rights and that took a much less overt effort. America has had a robust 3-branch system (executive, legislative, judiciary) corrupted by a singular cult of personality. THAT was highly unlikely to happen, but here we are.
With this being said, I do predict that anonymity on the web is going to be phased out. It will result in all sorts of changes to cultural norms across western nations that largely will curtail rights. I dread it.
Shouldn't we try tracing IP addresses and fining organizations for letting the traffic through or originating the traffic first? Seems a lot simpler.
Normal and sane people understand this intuitively. If someone goes to a mechanic because their car is broken and the mechanic says "well, if you can tell that you car is broken, then you should be able to figure out how to fix it" - that mechanic would be universally hated and go out of business in months. Same thing for a customer complaining about a dish made for them in a restaurant, or a user pointing out a bug in a piece of software.
I dont think the person was claiming annubis doesnt work, they were disputing PoW is the reason it actually works.
The issue is we want “good” government and “good” corporate behaviour but not the bad. And knowing the difference especially ahead of time requires engaged citizenry, lots of feedback mechanisms that are not overwritten by corruption and noise in the mechanism (ie primaries materringnmore than elections is a feedback mechanism fail in my book)
>>> It will result in all sorts of changes to cultural norms across western nations
I quite agree - but I (hope / think) that the benefits can outweighs the downsides if done well. Those nations that do it well will I believe find a rocket like boost to society and industry perhaps akin to post 1945 world. Those who don’t will fall behind.
>difficulty 1 (lowest setting)
literally in the comment you're responding to
Edit: oh i think you mean in c the string comparison short curcuits. I would expect the same to be true in javascript too. Its true in most languages.
Maybe you are just worried about general language overhead, which is a fair point. Is the anubis check even using multiple threads? For the c case, the real benefit wouldn't be if you can use C, but if you can use the GPU.
The whole thing is kind of silly though. SHA256 is a terrible choice of hash for PoW. They should be using argon2 or something memory heavy.
Literally the grandparent of the comment chain you're responding to.
Modern bitcoin miners do a double sha256 hash and increment in just a little bit more than a single hash of work. The input is 80 bytes, which is two compression rounds of 64 bytes in sha256, only the data in the second round has changed (the appended nonce), so you don’t bother doing the first compression round again. With other quirks you can end up doing multiple hashes at once “asicboost” due to partial collisions within the input too.