[Solved] Galaxy Docker Registry Downtime

nauzer · July 18, 2021, 6:24am

Our Galaxy applications have crashed with the following log entries on Galaxy.
2021-07-18 04:31:14+02:00The container is being stopped because Galaxy is replacing the machine it’s running on.fjahz
2021-07-18 04:31:16+02:00Application exited with signal: terminatedapp
2021-07-18 04:31:25+02:00The app is unavailable, this is not a Galaxy issue, you should analyze your app, read more here Container environment | Galaxy Docs

Trying to deploy new is reject with the following errors:
Talking to Galaxy servers at https://eu-west-1.galaxy-deploy.meteor.com
Error deploying application: 502 Bad Gateway: Registered endpoints failed to handle the request. If you’re the owner of this app, see this article for more information: Error Types | Galaxy Docs (2)

Furthermore accessing Galaxy seems to be super unstable as well. Page loads take forever. With page load errors as well:
502 Bad Gateway: Registered endpoints failed to handle the request. If you’re the owner of this app, see this article for more information: Error Types | Galaxy Docs (2)

Stop/Start, Scale up/down, New deploy are ALL not working without any log entries.
Trying to redeploy a new application now to see if we can redirect DNS.

nauzer · July 18, 2021, 6:46am

More findings:

Deploying NEW application does not work. It builds alright locally, uploads to galaxy. Galaxy successfully builds the Docker image and writes it to registry. However afterwards it cannot build it to spin up a container. No further visible logs:

Screenshot 2021-07-18 at 08.38.421920×999 250 KB
Scaling up other applications by adding containers does NOT work. The container cannot be built.
Resizing applications container sizes of applications that were still running does NOT work. Causes application to become unavailable as well.
I’m suspecting Galaxy Dashboard is a Meteor app itself? As it is experiencing the same kind of problems it looks like.

nauzer · July 18, 2021, 6:54am

Conclusion so far:

It looks like Galaxy is having problems spinning up container from any (NEW or EXISTING) Docker container image. See errors above.

Since ALL affected applications in our organization that are now down have the following log entry:
dvyfm2021-07-15 03:15:37+02:00The container is being stopped because Galaxy is replacing the machine it’s running on.

It seems like Galaxy initiates a container replacement which is not working because the new container cannot be spun up. Leaving the application in UNAVAILABLE state.

I think this is quite an urgent matter and I hope someone will look at this soon!

nauzer · July 18, 2021, 8:03am

Issue above on eu-west-1 region.
Tried deploying to us-east-1 region as well - same result…

rijk · July 18, 2021, 8:34am

Same! On eu-west-1. I tried stopping and restarting, but it can’t seem to start up new containers. It keeps looping and trying to start a new one.

nauzer · July 18, 2021, 9:24am

We’ve reset our DNS to point to a Digital Ocean Droplet (replaced CNAME with A).
Deployed with mup (GitHub - zodern/meteor-up: Production Quality Meteor Deployment to Anywhere) and it works fine.

@filipenevola Urgent issue.

Going to migrate all our applications to self-hosted environment again now…

kevanstuart · July 18, 2021, 9:34am

Right now, ap-southeast-2 is working.

rijk · July 18, 2021, 10:03am

Judging from the uptime alerts this issue started about 7.5 hours ago. Not good for Galaxy’s uptime stats! Lucky it’s a Sunday…

nauzer · July 18, 2021, 11:20am

@rijk ofcourse it can always be worse but this one it quite a severe one if you ask me … Just wondering why they would close a container before a new one has finished spinning up in this scenario.

Migrated all to custom (mup) servers now and waiting for DNS records to flush.
Guess we need to work on a more solid swap to another hosting platform for the future but curious what the folks at Meteor/Galaxy are going to say on this one…

rijk · July 18, 2021, 11:53am

https://status.meteor.com lists “All Systems Operational”. So their status reporting clearly doesn’t cover everything. Because this suddenly stopped working, I changed nothing, my last deploy was a month ago.

nauzer · July 18, 2021, 12:19pm

We have applications (running exactly the same bundle/codebase) that are up and running.

The issues seem to start when a new container needs to be spun up for any reason, e.g.
either resize a container, deploy a totally new application or if Galaxy decides to kill your container (The container is being stopped because Galaxy is replacing the machine it’s running on…)

So advice:

Don’t scale
Don’t stop/start
Better hope your code is solid because unhealthy container replacement likely also fails if any error causes the container to crash.
Don’t deploy to existing applications (Haven’t tested deploying to an application that is still running but my fear is it will have the same result and I’m not willing to give it a test for obvious reasons)
Have a backup ready in form of a mup deployment or move to ap-southeast-2 region (unvalidated - based on report above by @kevanstuart ) which would require you to delete the existing application if you need to use the primary domain. Means you loose any logging or version history.

rijk · July 18, 2021, 12:39pm

Yeah this is pretty bad. My app has been offline for 10+ hours now.

rijk · July 18, 2021, 12:42pm

By the way, for me it started at almost the exact same time.

2021-07-18 04:31:12+02:00 The container is being stopped because Galaxy is replacing the machine it's running on.
2021-07-18 04:31:14+02:00 Application exited with signal: terminated app
2021-07-18 04:31:25+02:00 The app is unavailable, this is not a Galaxy issue, you should analyze your app, read more here https://galaxy-guide.meteor.com/container-environment.html#unhealthy

filipenevola · July 18, 2021, 3:35pm

Hi, we are investigating, follow the status here please Galaxy Status - Investigating issues with new deploys

rijk · July 18, 2021, 5:01pm

Ok, app is finally back online after almost 15 hours.

“We have a self-signed certificate in our Docker Registry that needs to be renewed each 5 years and it expired yesterday night.”

Wow… Will be interested to hear what steps you will take to prevent this type of things from happening in the future. This all feels really unprofessional, and I’m especially concerned by the fact that this was not picked up until after 13 hours (with the status page showing “All Systems Operational”). This does not give me faith in Galaxy.

filipenevola · July 18, 2021, 6:05pm

Hi @rijk we understand your frustration and we are frustrated as well.

We have more than 400 monitors in Datadog that calls us immediately and we always have engineers on call to act on eventual issues.

The problem here was that these certificates from the Registry were not included in the monitoring and that is why they started to fail without notifying us.

Also, most of the apps were not affect and because of that the errors in tasks creation (ECS containers) and image builder fails (new deploys) were not enough to trigger other monitors that we have. It’s normal for tasks and images to fail sometimes but we are going to reduce these limits as well do it will be more sensitive next time.

We are going to post more details about the issue and mitigation plan for the future in the status page later Galaxy Status - Investigating issues with new deploys

We work really hard to make our services reliable for all the apps and we are very proud of the results that we have achieved so far but unfortunately in some rare cases problems happen. There is no justification, we just need to be better, improve our monitors and think about more things that can go wrong.

Sorry!

filipenevola · July 18, 2021, 6:05pm

Just to be clear as well about support, we also have support levels where our customer can trigger a PagerDuty immediately, this is available for enterprise clients for example but as usually enterprise clients run their apps with more than 3 containers (high availability) so in this case they were not affected.