Docker Hub is down (again)

UPDATE: It's up. But the link that previously was a status page now redirects to something else.

Details on this incident: https://www.dockerstatus.com/pages/history/533c6539221ae15e3...

Unacceptable level of communication during critical downtime; I know no one who was able to access or use Docker but it is still listed in the history as partial service degradation

Is this why Home Assistant on my Raspberry Pi cannot install the Matter server update from 8.1.0 to 8.1.1 that it is telling me is available?

It gives this error:

> Error during service call to update.install: Error updating Matter Server: Can't install homeassistant/aarch64-addon-matter-server:8.1.1: 401 Client Error for http+docker://localhost/v1.51/images/create?tag=8.1.1&fromImage=homeassistant%2Faarch64-addon-matter-server&platform=linux%2Farm64: Unauthorized ("unauthorized: authentication required")

From the "localhost" in the URL I assumed it was an error with a local Docker instance but I have no idea how HA actually works under the hood. I used the install method where you use the Raspberry Pi Imager to make a bootable HA RPi image and that takes complete control of the RPi. There's a Linux in there, but I've got no login on it. It is a complete black box to me with all my interaction through their web interface or the mobile app. Presumably it has to get 8.1.1 of the Matter server from somewhere, and if that is failing maybe it makes the localhost Docker fail too?

Thank goodness we don’t basically have a monoculture…right guys?

Crazy that we're 1 hour in and even basic authentication is still down... and no updates?!

It's funny that the status page is all green, and says "All Systems Operational"

got an update: [Identified] We are continuing to work on implementing a fix. We will update as the status evolves.

....

UPDATE: It's up. But the link that previously was a status page now redirects to something else.

Details on this incident: https://www.dockerstatus.com/pages/history/533c6539221ae15e3...

Unacceptable level of communication during critical downtime; I know no one who was able to access or use Docker but it is still listed in the history as partial service degradation

Crazy that we're 1 hour in and even basic authentication is still down... and no updates?!

i wonder why they can't rollback

Thank goodness we don’t basically have a monoculture…right guys?

got an update: [Identified] We are continuing to work on implementing a fix. We will update as the status evolves.

....

We will never learn. I want GitHub to go down for days. :D

Is this why Home Assistant on my Raspberry Pi cannot install the Matter server update from 8.1.0 to 8.1.1 that it is telling me is available?

It gives this error:

Yes, that's an http connection to the Docker Engine api on localhost failing due to the same issue—the docker engine cam't negotiate with the Docker Hub to get the new image and is passing the error back through the local api to your updater process.

It's funny that the status page is all green, and says "All Systems Operational"

We will never learn. I want GitHub to go down for days. :D

i wonder why they can't rollback

probably because docker hub is down

Yeah, they turned yellow after around 15-30 minutes of the incident

I wonder why isn't it automated

probably because docker hub is down

Lol, that would actually be funny, they can't restart it because it would require pulling the image from itself.

Yeah, they turned yellow after around 15-30 minutes of the incident

I wonder why isn't it automated

Status pages stopped being automated a long time ago because they are bad PR.

Often you’d have dozens if not hundreds of services on a status page. If you have a major networking outage for example, then everything is technically down. Someone screen shots the sea of red that your automated status page is showing and tweets “lol everything is down at [insert company]. Then you get a million imverysmart people posting about single point of failure or whatever.

As a result status pages, in every place I know, require a human to actually declare the outage there. Internal ones are usually automated, but if your service is down due to dependency on another service, you don’t mark yourself as down.

Also most places I know of have moved away from public status alerts anyway. You get a customized alert in your account or email if you happen to be impacted by a particular outage. The public ones are for the very very _very_ bad outages.

My guess/experience - because there are probably layers of management and executives who have an uptime # in their OKRs or whatever is fashionable these days.

The decision to post anything about outages comes from the executive chain in many orgs lest they miss out on bonus compensation for the year.

This is the same reason services like docker and aws will very rarely call an outage an 'outage' - it's always 'service degradation', even when dockerhub is completely useless as it is right now.

I am surprised that they are "working on a fix" for more than 2 hours now, given the scope of the problem.

Lol, that would actually be funny, they can't restart it because it would require pulling the image from itself.

My guess/experience - because there are probably layers of management and executives who have an uptime # in their OKRs or whatever is fashionable these days.

The decision to post anything about outages comes from the executive chain in many orgs lest they miss out on bonus compensation for the year.

This is the same reason services like docker and aws will very rarely call an outage an 'outage' - it's always 'service degradation', even when dockerhub is completely useless as it is right now.

Status pages stopped being automated a long time ago because they are bad PR.

My understanding is that it's also a legal CYA. If you have SLAs in place, outages might mean you owe money. So companies tend to err on the side of underreporting.

I am surprised that they are "working on a fix" for more than 2 hours now, given the scope of the problem.

Also this comes just a couple of days after a similar incident affected all of Spain

Are you refering to the blocking of Cloudflare when La Liga matches are played? That affects sites that use Cloudflare, but it's not the fault of Dockerhub.

My understanding is that it's also a legal CYA. If you have SLAs in place, outages might mean you owe money. So companies tend to err on the side of underreporting.

Are you refering to the blocking of Cloudflare when La Liga matches are played? That affects sites that use Cloudflare, but it's not the fault of Dockerhub.

Hacker Times

Hacker Times

Docker Hub is down (again)

Discussion

Discussion