It seems that OpenAI is scraping [certificate transparency] logs

Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.

This could be OpenAI, or it could be another company using their header pattern.

It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.

Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.

EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.

If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...

Then all they know is the main domain, and you can somewhat hide in obscurity.

I don't understand the outrage in some of the comments. The certificate transparency logs are literally meant to be read by absolutely whoever wants to read them. The clue is right in the name. It's transparency logs! Transparency!

I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?

Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!

Is it still “scraping” when the purpose of these transparency logs is to be used for this purpose?

For many years now. The crawlers, scanners, and bots start hammering a website within a minute of a certificate being issued. Remember to get your garbage WCM installed and secured before installing the real certificate as you have about a 15 second window before they're hammering around for fresh wordpress installs. Granted, you people are all smart enough to have all that automated using a CI/CD pipeline so that you just commit a single file with the domain name to a git repo and all that magic happens.

OpenAI is scraping everything that is publicly accessible. Everything.

Looking around at the comments, I have a birds-eye view. People are quite skilled at jumping to conclusions or assuming their POV is the only one. Consider this simplified scenario to illustrate:

    - X happened
    - Person P says "Ah, X happened."
    - Person Q interprets this in a particular way
        and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
        (and indifferent to what others notice
         or might know or be interested in)
        ...says "(yawn)".
    - Person S narrowly looks at Person R and says
        "Oh, so you think Repugnant-X is ok?"

What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.

See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum

* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.

Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?

>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;

If you want to learn more about Certificate Transparency Logs, how to pull and search them, we just did a 3 part series about how we did this at CertKit: https://www.certkit.io/blog/searching-ct-logs

Anyone went with wildcard certificates to avoid disclosing subdomains in certificate transparency logs?

They definitely do. Before this comment CT logs – aside from DNS queries – were the only way to know about https://onion.basilikum.monster and you have to send the hostname in the SNI, otherwise you get another certificate back.

Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.

That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.

I usually get a cert for my public domain (root and usually with www. as a Subject Alternate Name (SAN)) and if I'm going to use subdomains I don't intend to become widely public, I'll add a wildcard SAN of *.example.com so I don't have to expose subdomains in transparency logs.

There's some security downside there if my web servers get hacked and my certs exfiltrated, but for a lot of stuff that tradeoff seems reasonable. I wouldn't recommend this approach of you were a bank or a government security agency or a drug cartel.

It seems that OpenAI is scraping [certificate transparency] logs

They would be far from first. Any time I create a Wildcard cert in LE I immediately see a ton of sub-domain enumeration in my DNS query logs. Just for fun I create a bunch of wildcard certs for domains I do not even use just to keep their bots busy ... not used as in parked domains. This has been going on about as long as the CT logs have existed.

So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.

This has been long the case! I think there whole business model is based off scraping lol

Seems like you could set up a cert for a honeypot domain to collect ips of bots running off of the certificate transparency logs. If domain isnt linked from anywhere, then its pretty sure to be a bot isn't it?

I wonder if this can be used to contaminate OpenAI search indexes?

Let's prompt inject it

yawn, i saw this more than 1000 times

privacy doesnt exist in this world

Your content is stolen for training the moment you put it up

Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.

For those that never looked at the CT logs: https://crt.sh/?q=ycombinator.com

(the site may occasionally fail to load)

With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.

The Web Archive also uses the Certificate Transparency logs, some websites that aren't linked anywhere end up in the Wayback Machine this way: https://archive.org/details/certificate-transparency?tab=abo...

"... for exacty this reason."

Needs clarification. What reason

What's the yawn for?

Certificate transparency log is a Google project. They don’t need to scrape it. They host all the data. It’s one of those projects where Google hosts it because it thinks it genuinely improves the internet, by reducing certificate authority abuse.

There's an extension to static-ct-api, currently implemented by Sunlight logs, that provides a feed of just SANs and CNs: https://github.com/FiloSottile/sunlight/blob/main/names-tile...

For example:

  curl https://tuscolo2026h1.skylight.geomys.org/tile/names/000 | gunzip

(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)

These exist for apex domains; the real use-case is subdomains.

If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...

Then all they know is the main domain, and you can somewhat hide in obscurity.

Yep, but this comes with a tradeoff: all of your services now have a valid key/cert for your whole domain, significantly increasing the blast radius if one service is compromised.

Unfortunately they are a bit extra bothersome to automate (depending on your DNS provider/setup) because of the DNS CNAME-method validation requirement.

Correct, that's what I did with caddy, which is now periodically renewing my wildcard certificate through a DNS-01 challenge.

Is it still “scraping” when the purpose of these transparency logs is to be used for this purpose?

If you want to learn more about Certificate Transparency Logs, how to pull and search them, we just did a 3 part series about how we did this at CertKit: https://www.certkit.io/blog/searching-ct-logs

The ostensible purpose of the certificate transparency logs is to allow validation of a certificate you're looking at - I browse to https://poormathskills.com and want to figure out the details of when its cert was issued.

The (presumably) unintended, unexpected purpose of the logs is to provide public notification of a website coming online for scrapers, search engines, and script kiddies to attack it: I could register https://verylongrandomdomainnameyoucantguess7184058382940052... and unwisely expect it to be unguessable, but as it turns out OpenAI is going to scrape it seconds after the certificate is issued.