Then all they know is the main domain, and you can somewhat hide in obscurity.
Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.
That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.
There's some security downside there if my web servers get hacked and my certs exfiltrated, but for a lot of stuff that tradeoff seems reasonable. I wouldn't recommend this approach of you were a bank or a government security agency or a drug cartel.
They would be far from first. Any time I create a Wildcard cert in LE I immediately see a ton of sub-domain enumeration in my DNS query logs. Just for fun I create a bunch of wildcard certs for domains I do not even use just to keep their bots busy ... not used as in parked domains. This has been going on about as long as the CT logs have existed.
It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.
>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;
- X happened
- Person P says "Ah, X happened."
- Person Q interprets this in a particular way
and says "Stop saying X is BAD!"
- Person R, who already knows about X...
(and indifferent to what others notice
or might know or be interested in)
...says "(yawn)".
- Person S narrowly looks at Person R and says
"Oh, so you think Repugnant-X is ok?"
What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum
* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.
I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?
Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!
privacy doesnt exist in this world
(the site may occasionally fail to load)
The (presumably) unintended, unexpected purpose of the logs is to provide public notification of a website coming online for scrapers, search engines, and script kiddies to attack it: I could register https://verylongrandomdomainnameyoucantguess7184058382940052... and unwisely expect it to be unguessable, but as it turns out OpenAI is going to scrape it seconds after the certificate is issued.
For example:
curl https://tuscolo2026h1.skylight.geomys.org/tile/names/000 | gunzip
(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)We think we're so different from animals https://en.wikipedia.org/wiki/Mimicry
This actually is a well-behaved crawler user agent because it identifies itself at the end.
Common Crawl's CCBot has published IP ranges. We aren't a search engine (although there are search engines using our data) and we like to describe our crawler as a crawler, not a "scraper".
The point of putting up a public web site is so the public can view it (including OpenAI/google/etc).
If I don’t want people viewing it, then I don’t make it public.
Saying that things are stolen when they aren’t clouds the issue.
P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]
> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.
https://openai.com/searchbot.json
I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.
$ curl -I https://www.cloudflare.com
HTTP/2 200
$ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
HTTP/2 403The whole purpose of this data is to be consumed by 3rd-parties.
Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.
https://www.merklemap.com/search?query=ycombinator.com&page=...
Entries are indexed by subdomain instead of by certificate (click an entry to see all certificates for that subdomain).
Also, you can search for any substring (that was quite the journey to implement so it's fast enough across almost 5B entries):
"I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:"
The reason presented by the blog post is "for what I assume are things to scrape from"
Putting aside the "assume" part (see below^1), is this also the reason that the other "systems" are "scraping" CT logs
After OpenAI "scrapes" then what does OpenAI do with the data (readers can guess)
But what about all the other "systems", i.e., parties that may use CT logs. If the logs are public then that's potentially a lot of different parties
Imagine in an age before the internet, telephone subscriber X sets up a new telephone line, the number is listed in a local telephone directory ("the phone book") and X immediately receives a phone call from telephone subscriber Z^2
X then writes an op-ed that suggests Z is using the phone book "for who to call"
This is only interesting if X explains why Z was calling or if the reader can guess why Z was calling
Anyone can use the phone book, anyone can use ICANN DNS, anyone can use CT logs, etc.
Why does someone use these public resources. Online commenter: "To look up names and numbers"
Correct. But that alone is not very interesting. Why are they looking up the names and numbers
1.
We can make assumptions about why someone is using a public resource, i.e., what they will use the data for. But that's all they are: assumptions
With the telephone, X could ask "Why are you calling?"
With the internet, that's not possible.^3 This leads to speculation and assumptions. Online commenters love to speculate, and often to make conclusions without evidence
No one knows _everything_ that OpenAI does with the data it collects except OpenAi employees. The public only knows about what OpenAi chooses to share
Similarly no one knows what OpenAI will do with the data in the future
One could speculate that it's naive to think that, in the longterm, data collected by "AI" companies will only be used for "AI"
2. The telephone service also had the notion of "unlisted numbers", but that's another tangent for discussion
3. Hence for example people who do port scans of the IPv4 address space will try to prevent the public from accessing them by restricting access to "researchers", etc. Getting access always involves contacting the people with the scans and explaining what the requester will do with the data. In other words, removing speculation
> "ipv4Prefix": "74.7.175.128/25"
Makes me want to reconfigure my servers to just drop such traffic. If you can't be arsed to send a well-formed UA, I have doubts that you'll obey other conventions like robots.txt.
Also, may I know which DNS provider you went with? The GitHub issues pages with some of the DNS provider plugins seems to suggest some are more frequently maintained, while some less so.
EDIT: that's the flip side of supporting HTTPS that's not well-known among developers - by acquiring a legitimate certificate for your service to enable HTTPS, you also announce to the entire world, through a public log, that your service exists.
To me, this is evidence that SQL databases with high traffic can be made directly accessible on the public internet
crt.sh seems to be more accessible at certain times of the day. I can remember when it had no such accessibility issues
https://letsencrypt.org/2025/12/02/from-90-to-45#making-auto...
I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it
It’s been going on forever (remember how companies were reading files off your computer aka cookies in 1999?)
This seems like a total non-issue and expect that any public files are scraped by OpenAI and tons others. If I don’t want something scraped, I don’t make it public.
It's the only website I know of where queries can just randomly fail for no reason, and they don't even have an automatic retry mechanism. Even the worst enterprise nightmares I've seen weren't this user unfriendly.
I'm also pretty content accepting the unpleasant parts of reality without spin or optimism. Sometimes the better choice is still crappy, after all ;) I think Oliver Burkeman makes a fun and thoughtful case in "The Antidote: Happiness for People Who Can't Stand Positive Thinking" https://www.goodreads.com/book/show/13721709-the-antidote
Looking at the README, is the idea that the certificates get generated on the DNS server itself? Not by the ACME client on each machine that needs a certificate? That seems like a confusing design choice to me. How do you get the certificate back to the web server that actually needs it? Or is the idea that you'd have a single server which acts as both the DNS server and the web server?
If you didn't want others to access your service, maybe consider putting it in a private space.
But what GP probably meant is that OAI definitely uses this log to get a list of new websites in order to scrap then later. This is a pretty standard way to use CT logs - you get a list of domains to scrap instead of relying solely on hyperlinks.
Oh, I read this as indicating OpenAI may make a move into the security space.
It's certainly news to me, and presumably some others, that this exists.
For example for "lillybank.com", I'll generate:
llllybank.com
liliybank.com
...
and countless others.Hundreds of thousands of entries. They then are null-routed from my unbound DNS resolver.
My browsers are forced into "corporate" settings where they cannot use DoH/DoT: it's all, between my browsers and my unbound resolver, in the clear.
All DNS UDP traffic that contains any Unicode domain name is blocked by the firewall. No DNS over TCP is allowed (and, no, I don't care).
I also block entire countries' TLD as well as entire countries' IP blocks.
Been running a setup like that (and many killfiles, and DNS resolvers known to block all known porn and know malware sites etc.) since years now already. The Internet keeps working fine.
Substring doesn't seem like what I'd want in a subdomain search.
(Which is why I hate it that it's so hard to test things locally without having to get a domain and a certificate. I don't want to buy domain names and announce them publicly for the sake of some random script that needs to offer a HTTP endpoint.)
Modern security is introducing a lot of unexpected couplings into software systems, including coupling to political, social and physical reality, which is surprising if you think in terms of programs you write, which most likely shouldn't have any such relationships.
My favorite example of such unexpected coupling, whose failures are still regularly experienced by users, is wall clock time. If your program touches anything related to certificates, even indirectly, suddenly it's coupled to actual real clock and your users better make sure their system time is in synch with the rest of the world, or else things will stop working.
Well, if you want only subdomains search for *.ycombinator.com.
https://www.merklemap.com/search?query=*.ycombinator.com&pag...
If certificate transparency is new to you, I feel like there are significantly more interesting articles and conversations that could/should have been submitted instead of "A public log intended for consumption exists, and a company is consuming that log". This post would do literally nothing to enlighten you about CT logs.
If the fact that OpenAI is scraping certificate transparency logs is new and interesting to you, I'd love to know why it is interesting. Perhaps I'm missing something.
Way more interesting reads for people unfamiliar with what certificate transparency is, in my opinion, than this "OpenAI read my CT log" post:
https://googlechrome.github.io/CertificateTransparency/log_s...
That’s not the intended use for CT logs.
if this is the article that introduces someone to the concept of certificate transparency, then there's nothing wrong with that. graciously, you followed through with links to what you consider more interesting. that is not something a lot of commenters do and just leave it as a snarky comment for someone being one of the lucky 10000 for the day.
Love it
Yes. What does it have to do with HTTPS?
> You hopefully also know that you can create your own certificate authority or self signed certificates and add them to your CA store.
Sorta, kinda. Does it actually work with third-party apps? Does it work with mobile systems? If not, then it's not a valid solution, because it doesn't allow me to run my stuff in my own networks without interfacing with the global Internet and social and political systems backing its cryptographic infrastructure.
https://github.com/Barre/ZeroFS#9p-recommended-for-better-pe...
benjojo posted 12 Dec 2025 20:46 +0000
lol.
I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:
Dec 12 20:43:04 xxxx xxx[719]:
l=debug
m="http request"
pkg=http
httpaccess=
handler=(nomatch)
method=get
url=/robots.txt
host=autoconfig.benjojo.uk
duration="162.176µs"
statuscode=404
proto=http/2.0
remoteaddr=74.7.175.182:38242
tlsinfo=tls1.3
useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt; +https://openai.com/searchbot"
referrr=
size=19
cid=19b14416d95
wolf480pl@mstdn.io replied 12 Dec 2025 20:57 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/Gxy2qrCkn1Y327Y6D3
@benjojo
wp-login.php bots have been doing that for years so I'd be surprised if OpenAI didn't
benjojo replied 12 Dec 2025 21:10 +0000
in reply to: https://mstdn.io/users/wolf480pl/statuses/115708595554461422
@wolf480pl yeah and I guess it's a non terrible way of "seeding" a "search engine"
wolf480pl@mstdn.io replied 13 Dec 2025 12:59 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/NgH2Xwlp4KhCTwHjRL
@benjojo
what if CT logs contained hash(domain, nonce) instead of containing the domain in plain, and the nonce was part of the CT inclusion proof?
benjojo replied 13 Dec 2025 14:53 +0000
in reply to: https://mstdn.io/users/wolf480pl/statuses/115712376924287199
@wolf480pl the point of certificate transparency logs is so that outside observers can do the double-checking of the CAs certificate and policy in full, if you mess with any part of this, the entire system becomes deeply exploitable and difficult to end to end verify
wolf480pl@mstdn.io replied 13 Dec 2025 15:55 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/lPLWBh3YCbFJBH4Dt6
@benjojo
oh, duh I need to be able to find who's issuing carts for my domain 
and I'm guessing some people look at all certs issued by CAs and verify certain criteria that may require knowing the domains...
it's kinda sad that it provides domain enumeration, but I guess putting addng zero-knowledge proofs to the mix would've been too complex
benjojo replied 13 Dec 2025 18:00 +0000
in reply to: https://mstdn.io/users/wolf480pl/statuses/115713071072619432
@wolf480pl tbh domain's are not really that secret, and if you depended on that then something was very wrong.
You can work around a lot of this stuff by "just" using wildcard certs instead
wolf480pl@mstdn.io replied 13 Dec 2025 18:07 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/pyX28McwZyTh14hy55
@benjojo
but then why bother with NSEC3...
benjojo replied 13 Dec 2025 23:29 +0000
in reply to: https://mstdn.io/users/wolf480pl/statuses/115713588719701003
@wolf480pl tbh I would argue why bother with DNSSEC (outside of extremely marginal situations), but NSEC3 even more
jamesog@mastodon.soc.. replied 12 Dec 2025 21:09 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/Gxy2qrCkn1Y327Y6D3
@benjojo It's interesting to watch web server logs to see what things pick up new CT entries the quickest