[Solved] Container stopped and won’t restart

One of our applications got terminated a couple of minutes ago and doesn’t restart by itself.

@filipenevola, hope this is not an effect of the fixes you are implementing regarding the expired certificates. This is causing even more downtime to one of our customers, in addition of the 15 hours of downtime we’ve exprienced today for another 5 of our customers.

Hi, in order to replace the certificates we need to replace the machines.

Galaxy does this slowly without causing downtime.

It first transfers the containers to the new machine and after this Galaxy shutdown old machines.

More details here [Solved] Galaxy Docker Registry Downtime - #16 by filipenevola

This container was down :no_mouth:

I’m not sure what you mean but containers are not restarted, they are replaced so the container will be killed but other container is going to be running in another machine with your app.

@filipenevola this application was down for 12 minutes as a result of a Galaxy initiated container replacement. We had to stop/start to get it running again. The start signal (manual!) made it spin up again.
The behaviour was similar to the other applications that went down as part of today’s issues. Just to make it clear; this happened after problems were reported resolved.

**

It makes us wonder whether we need to stay up to monitor downtime on other applications/containers because of the restarts you’ve announce as part of the resolution.

**

Logging (TZ CET, GMT+2)
kme9n2021-07-18 19:18:54+02:00The container is being stopped because Galaxy is replacing the machine it’s running on.

kme9n2021-07-18 19:18:55+02:00Application exited with signal: terminated

app2021-07-18 19:19:05+02:00The app is unavailable, this is not a Galaxy issue, you should analyze your app, read more here Container environment | Galaxy Docs

1806z2021-07-18 19:31:17+02:00Application process starting, version 39

Screenshot:

These replacements of machines happen every day without causing any issues.

The problem here is that some machines were still running with old certificates (old machines) so this could have affected you.

But Galaxy also recovers and creates new containers again, but as we had app machines with old certificates this caused problems.

So the app machine replacement was not the problem, but the old certificates.

I hope this makes my explanation clearer.

We understood that yes; but the reported problem in this ticket as opposed to the one reported in this one is:

The container replacement/spin up happened after the problem was reported resolved…

The machine replacement takes time to avoid moving many containers from the same app too fast. The underlying issue was indeed solved when we reported.

Meaning - any Galaxy initiated container replacement that happens as of the moment of reporting ‘fixed’ should have not cause issues - correct?

Sorry for not letting loose on this but if we cannot be sure that any replacement occurring over the coming night doesn’t cause the same issue again we need to know and have an alternative…

Yes, as I said above, Galaxy performs this replacements every day, for many years, without any issues.

This issue was caused by the Registry and not by the mechanism executing the replacement.