The future is dark I mean.. Darknets.. For people by people. Where you can deal with bad actors.. Wake up! and starting networking :)
How to block ASs? Just write a small script that queries all of their subnets once (even if it changes, its not so much to have an impact) and add them to a nft set (nft will take care of aggregating these into continouus blocks). Then just make nft reject requests from this set.
Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3
1. Anubis is a miracle.
2. Because most scrapers suck, I require all requests to include a shibboleth cookie, and if they don’t, I set it and use JavaScript to tell them to reload the page. Real browsers don’t bat an eye at this. Most scrapers can’t manage it. (This wasn’t my idea; I link to the inspiration for it. I just included my Caddy-specific instructions for implementing it.)
If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses.
Make only the HEAD of each branch available. Anyone who wants more detail has to clone it and view it with their favourite git client.
For example https://mitxela.com/projects/web-git-sum (https://git.mitxela.com/)
Put it all behind an OAuth login using something like Keycloak and integrate that into something like GitLab, Forgejo, Gitea if you must.
However. To host git, all you need is a user and ssh. You don’t need a web ui. You don’t need port 443 or 80.
I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt.
I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.
Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.
as always: imho. (!)
idk ... i just put a http basic-auth in front of my gitweb instance years ago.
if i really ever want to put git-repositories into the open web again i either push them to some portal - github, gitlab, ... - or start thinking about how to solve this ;))
just my 0.02€
Make sure your caches are warm and responses take no more than 5ms to construct.
Cloudflare will even do it for free.
Maybe this is worth trying out first, if you are currently having issues.
On "private" services where I or my friends are the only users, I block everything except my country.
This btw is nothing new. Way back when I still used wordpress, it was quite common to see your server logs filling up with bots trying to access endpoints for commonly compromised php thingies. Probably still a thing but I don't spend a lot of time looking at logs. If you run a public server, dealing with maliciously intended but relatively harmless requests like that is just what you have to do. Stuff like that is as old as running stuff on public ports is.
And the offending parties writing sloppy code that barely works is also nothing new.
AI opportunism certainly has added a bit of opportunistic bot and scraper traffic but it doesn't actually change the basic threat model in any fundamental way. Previously version control servers were relatively low value things to scrape. But code just became interesting for LLMs to train on.
Anyway, having any kind of thing responding on any port just invites opportunistic attempts to poke around. Anything that can be abused for DOS purposes might get abused for exactly that. If you don't like that, don't run stuff on public servers or protect them properly. Yes this is annoying and not necessarily easy. Cloud based services exist that take some of that pain away.
Logs filling up with 404, 401, or 400 responses should not kill your server. You might want to implement some logic that tells repeat offenders 429 (too many requests). A bit heavy handed but why not. But if you are going to run something that could be used to DOS your server, don't be surprised if somebody does that.
5 years ago there were few people with an active interest in scraping ForgeJo instances and personal blogs. Now there are a bajillion companies and individuals getting data to train a model or throw in RAG or whatever.
Having a better scraper means more data, which means a better model (handwavily) so it’s a competitive advantage. And writing a good, well-behaved distributed scraper is non-trivial.
I don’t think they mean scrapers necessarily driven by LLMs, but scrapers collecting data to train LLMs.
You said it yourself. If you're selling a cure, you might as well start a plague.
Watched it for a while, thinking eventually it'd end. It didn't, seemed like Claudebot and GPTBot (which was the only two I saw, but could have been forged) went over the same URLs over and over again. They tried a bunch of search queries too at the same time.
The day after I got tired of seeing it so added a robot.txt forbidding any indexing. Waited a few hours, saw that they were still doing the same thing, so threw up basic authentication with `wiki:wiki` as the username:password basically, wrote the credentials on the page where I linked it and as expected they stopped trying after that.
They don't seem to try to bypass anything, whatever you put in front will basically defeat them except blocking them by user-agent, then they just switch to a browser-like user-agent instead, which is why I went the "trivial basic authentication" path instead.
Wasn't really an issue, just annoying when they try to masquerade as normal users. Had the same issue with a wiki instance, added rate limits and eventually they seemingly backed off more than my limits were set too, so I guess they eventually got it. Just checked the logs and seems they've stopped trying completely.
Seemingly it seems like people who are paying for their hosting by usage (which never made sense to me) is the ones hard hit by this. I'm hosting my stuff on a VPS, and don't understand what the big issue is, worst case scenario I'd add more aggressive caching and it wouldn't be an issue anymore.
Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice.
"It's for AI" feels like lazy reasoning for me... but what IS it for?
One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers?
You don't really need to guess, it's obvious from the access logs. I realize not everyone runs their own server, so here are a couple excerpts from mine to illustrate:
- "meta-externalagent/1.1 +https://developers.facebook.com/docs/sharing/webmasters/craw...)"
- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)"
- [...] (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
And to give a sense of scale, my cgit instance recieved 37 212 377 requests over the last 60 days, >99% of which are bots. The access.log from nginx grew to 12 GiB in those 60 days. They scrape everything they can find, indiscriminately, including endpoints that have to do quite a bit of work, leading to a baseline 30-50% CPU utilization on that server right now.
Oh, and of course, almost nothing of what they are scraping actually changed in the last 60 days, it's literally just a pointless waste of compute and bandwidth. I'm actually surprised that the hosting companies haven't blocked all of them yet, this has to increase their energy bills substantially.
Some bots also seem better behaved then others, OpenAI alone accounts for 26 million of those 37 million requests.
While throwing out all users who opt-in to javascript, using Noscript or uBlock or something like it, may be acceptable collateral damage to you, it might be good to keep in mind that this plays right into Big Adtech's playbook. They spend over two decades to normalize the behavior of running a hundred or more programs of untrusted origin on every page load, and to treat users to opt-in to running code in a document browser with suspicion. Not everyone would like to hand over that power to them on a silver platter with a neat little bow on top.
location ~ commit/* {
return 404;
}maybe something like https://ssheasy.com/ or similar could also be used? or maybe even a gotty/xterm instance which could automatically ssh/get a tui like interface.
I feel as if this would for all scrapers be enough?
We used nginx config to prevent access to individual commits, while still leaving the "rest" of what gitea makes available read-only for non-auth'ed access unaffected.
Why? Data. Every bit of it is it might be valuable. And not to sound tin foil hatty, but we are getting closer to a post-quantum time (if we aren't already ).
I have no idea if it actually works as advertised though. I don't think I've heard from anyone trying it.
It's a race to the bottom. What's different is we're much closer to the bottom now.
- Caching helps, but is nowhere near a complete solution. Of the 4M requests I've observed 1.5M unique paths, which still overloads my server.
- Limiting request time might work, but is more likely to just cause issues for legitimate visitors. 5ms is not a lot for cgit, but with a higher limit you are unlikely to keep up with the flood of requests.
- IP ratelimiting is useless. I've observed 2M unique IPs, and the top one from the botnet only made 400 well-spaced-out requests.
- GeoIP blocking does wonders - just 5 countries (VN, US, BR, BD, IN) are responsible for 50% of all requests. Unfortunately, this also causes problems for legitimate users.
- User-Agent blocking can catch some odd requests, but I haven't been able to make much use of it besides adding a few static rules. Maybe it could do more with TLS request fingerprinting, but that doesn't seem trivial to set up on nginx.
Let's not forget that scrapers can be quite stupid. For example, if you have phpBB installed, which by defaults puts session ID as query parameter if cookies are disabled, many scrapers will scrape every URL numerous times, with a different session ID. Cache also doesn't help you here, since URLs are unique per visitor.
self-hosting was originally a "right" we had upon gaining access to the internet in the 90s, it was the main point of the hyper text transfer protocol.
The crawlers for the big famous names in AI are all less well behaved and more voracious than say, Googlebot. Though this is all somewhat muddied by companies that ran the former "good" crawlers all also being in the AI business and sometimes trying to piggyback on people having allowed or whitelisted their search crawling User-Agent, mostly this has settled a little where they're separating Googlebot from GoogleOther, facebookexternalhit from meta-externalagent, etc. This was an earlier "wave" of increased crawling that was obviously attributable to AI development. In some cases it's still problematic but this is generally more manageable.
The other stuff, the ones that are using every User-Agent under the sun and a zillion datacenter IPs and residential IPs and rotate their requests constantly so all your naive and formerly-ok rate-based blocking is useless... that stuff is definitely being tagged as "for AI" on the basis of circumstantial evidence. But from the timing of when it seemed to start, the amount of traffic and addresses, I don't have any problem guessing with pretty high confidence that this is AI. To your question of "who are the customers"... who's got all the money in the world sloshing around at their fingertips and could use a whole bunch of scraped pages about ~everything? Call it lazy reasoning if you'd like.
How much this traces back ultimately to the big familiar brand names vs. would-be upstarts, I don't know. But a lot of sites are blocking their crawlers that admit who they are, so would I be surprised to see that they're also paying some shady subcontractors for scrapes and don't particularly care about the methods? Not really.
Maybe its time for me to go ahead and start it again with logs to see if there are any logs.
I will maybe test it in all three 1) With CF tunnels + AI Block, 2) Only CF tunnels, 3) On a static IP directly. Maybe you can try the experiment too and we can compare our findings (also saying because I am lazy and I had misconfigured that cf tunnel so when it quit, I was too lazy to restart the vps given I just use it as a playground and just wanted to play around self hosting but maybe I will do it again now)
Used to be you needed to implement some papers to do sentiment analysis. Reasonably high bar to entry. Now anyone can do it, the result: more people doing scraping (in less competent scrapers too).
Well they are scraping web pages from a git forge, where they could just, you know, clone the repo(s) instead.
https://bandie91.github.io/dumb-http-git-browser-js-app/ui.h...
I added a robots.txt with explicit UAs for known scrapers (they seem to ignore wildcards), and after a few days the traffic died down completely and I've had no problem since.
Git frontends are basically a tarpit so are uniquely vulnerable to this, but I wonder if these folks actually tried a good robots.txt? I know it's wrong that they ignore wildcards, but it does seem to solve the issue
May be everyone is trying to take advantage of the situation before law eventually catches up.
> ChatGPT-User is not used for crawling the web in an automatic fashion. Because these actions are initiated by a user, robots.txt rules may not apply.
So, not AI training in this case, nor any other large-batch scraping, but rather inference-time Retrieval Augmented Generation, with the "retrieval" happening over the web?
This has zero to do with Adtech for 99.99% of uses, either. Web devs like to write TypeScript and React because that's a very pleasant tech stack for writing web apps, and it's not worth the effort for them to support a deliberately hamstrung browser for < 0.1% of users (according to a recent Google report).
See also: feel free to disable PNG rendering, but I'm not going to lift a finger to convert everything to GIFs.
Imagine a task to enumerate every possible read-only command you could make against a Git repo, and then imagine a farm of scrapers running exactly one of them per IP address.
Ugh.
It seems to me to be just as likely that people are installing LLM chatbot apps that do the occasional bit of scraping work on the sly, covered by some agreed EULA.
http {
# ... other http settings
limit_req_zone $binary_remote_addr zone=mylimit:10m rate=10r/s;
# ...
}
server {
# ... other server settings
location / {
limit_req zone=mylimit burst=20 nodelay;
# ... proxy_pass or other location-specific settings
}
}
Rate limit read-only access at the very least. I know this is a hard problem for open source projects that have relied on web access like this for a while. Anubis?We should be able to achieve close to the same results with some configuration changes.
AWS / Azure / Cloudflare total centralization means no one will be able to self host anything, which is exactly the point of this post.
That Cloudflare is trying to monetise “protection from AI” is just another grift in the sense that they can’t help themselves as a corp.
So yes, they are definitely running scrapers that are this badly written.
Also old scraper bots trying to disguise themselves as GPTBot seems wholly unproductive, they're try to immitate users, not bots.
Yes, hence the "which was the only two I saw, but could have been forged".
> I'd love to see some of the web logs from this if you'd be willing to share!
Unfortunately not, I'm deleting any logs from the server after one hour, and also don't even log the full IP. I took a look now and none of the logs that still exists are from any user agent that looks like one of those bots.
I think the reason is that America & China for the most part are also in AI arms race combined with an AI bubble and neither side would wish to lose literally any percieved advantage to them no matter the cost on others.
Also there is an immense lobbying effort against senators who propose for a stricter AI regulation.
https://www.youtube.com/watch?v=DUfSl2fZ_E8 [What OpenAI doesn't want you to know]
It's actually a great watch. Highly recommended because a lot of talks about regulations does feel to me as mirrors and smoke.
But the sheer volume makes it unlikely that's the only reason. It's not like everybody has constantly questions bout the same tiny website.
I think this hits the crux of the trend fairly well.
And is why I have so many workarounds to shitty JS in my user files.
Because I can't see your CSS, either.
I can't provide evidence as it's close to impossible to separate the AI bots using residential proxies from actual users, and their IPs are considered personal data. But as the other reply shows, it's easy enough to find people selling this service.
As for what you can do on your own, it really depends on your network. OpenWRT routers can run tcpdump, so you can check for suspicious connections or DNS requests, but it gets really hard to tell if you have lots of cloud-tethered devices at home. IoT, browser extensions, and smartphone applications are the usual suspects.
Your router may have the ability to log requests, but many don't, and even if yours does, if you're concerned the device may be compromised, how can you trust the logs?
BUT, with all that said, these attacks are typically not very sophisticated. Most of the time they're searching for routers at 192.168.1.1 with admin/admin as the login credentials. If you have anything else set, you're probably good from 97% of attackers (This number is entirely made up, but seriously that percentage is high). You can also check for security advisories on your model of router. If you find anything that allows remote access, assume you're compromised.
---
As a final note, it's more likely these days that the devices running these bots are IoT devices and web browsers with malicious javascript running.
Aside from the obvious smoke tests (are settings changing without your knowledge? Does your router expose access logs you can check?), I'm not sure there's any general purpose way to check, but 2 things you can do are:
1. search for your router's model number to see if it's known to be vulnerable, and replace it with a brand-new reputable one if so (and don't buy it from Amazon).
2. There are vendors out there selling "residential proxy IP databases", (e.g., [1]) no idea how good they are, but if you have a stable public IP address you could check whether you're on that.
I suspect that some of these folks are not interested in a proper solution. Being able to vaguely claim that the AI boogeyman is oppressing us has turned into quite the pastime.
your PNG/GIF thing is nonsense (false equivalence, at least) and seems like deliberate attempt to insult
> I'm marginally sympathetic
you say that as if they've done some harm to you or anyone else. outside of these three words, you actually seem to see anyone doing this as completely invalid and that the correct course of action is to act like they don't exist.
Search for: "residential proxy" ai data scraping.
Start reading through thousands of articles.
But I think what OP is implying is insecure hardware being infected by malware and access to that hardware sold as a service to disreputable actors. For that buy a good quality router and keep it up to date.
It's painful to have your site offline because a scraper has channeled itself 17,000 layers deep through tag links (which are set to nofollow, and ignored in robots.txt, but the scraper doesn't care). And it's especially annoying when that happens on a daily basis.
Not everyone wants to put their site behind Cloudflare.
Because this is something which is happening continuously & i have observed so many HN posts like these (Anubis iirc was created by its creator out of such frustration too). Git servers being scraped to the point of its effectively an DDOS.
Here-in is the problem. And if you block them, you risk blocking actual customers.
Ok, it is over. End of an era for me. No more self-hosted git. I had a public git server running since 2011, and a public cvs server before that. AI scrapers have hammered the poor, little server to death by flooding the cgit frontend with tons of pointless² requests. Actually a few months ago already.
Now I finally decided to not try rebuild the server, be it with or without cgit web frontend. I don't feel like taking up the fight with the scrapers in my spare time, I leave that to people who are in a better position to do so. Most repositories had mirrors on one or two of the large gitforges already. Those are the primary repositories now. Go look at gitlab and github.
Last week I've fixed all (I hope) dangeling links to the cgit repsitories to point to the forges instead.
Now I'm down to one self-hosted service, which is the webserver hosting mainly this blog and a few more little things. In 2018 I've migrated the blog from wordpress to jekyll, so it is all static pages. Taking this out by AI scrapers overloading the machine should be next to impossible, and so far this has hold up.
Nevertheless AI scrapers already managed to trigger one outage. Apparently millions of 404 answers where not enough to convince the bots that there is no cgit service (any more). Apache had no problems to deliver those, but the logs have filled up the disk so fast that logrotate didn't manage to keep things under control with the default configuration. Fixed config. Knook wood.
¹ Title inspired by the 2025 edition of Security Nightmares. Fun watching if you speak german.
² Most inefficient way to get the complete repo. Just clone it, ok?
FWIW, you're literally in a comment thread where GP (me!) says "don't understand what the big issue is"...
> you say that as if they've done some harm to you or anyone else.
I was literally responding to someone referring to themselves as "collateral damage" and saying I'm playing into "Big Adtech's playbook". I explained why they're wrong.
> the correct course of action is to act like they don't exist.
Unless someone is making a site that explicitly targets users unwilling or unable to execute JavaScript, like an alternative browser that disables it by default or such, mathematically, yes, that's the correct course of action.
Thanks for the info, wish I didn't know :-(
2026-01-28 21'460
2026-01-29 27'770
2026-01-30 53'886
2026-01-31 100'114 #
2026-02-01 132'460 #
2026-02-02 73'933
2026-02-03 540'176 #####
2026-02-04 999'464 #########
2026-02-05 134'144 #
2026-02-06 1'432'538 ##############
2026-02-07 3'864'825 ######################################
2026-02-08 3'732'272 #####################################
2026-02-09 2'088'240 ####################
2026-02-10 573'111 #####
2026-02-11 1'804'222 ##################1. The residential proxies
2. Scrapers, on behalf of or as an agent of the data buyer
3. Data buyer (ai training)
Scrapers are buying from residential proxies, giving the data buyer a bit of a shield/deniability.
The scrapers don't want to get outright blocked if they can avoid it, otherwise they have nothing to sell.
Thoughts on having an ssh server with https://github.com/charmbracelet/soft-serve instead?
I never removed anything, but I'll keep this in mind for the future.