Hi folks. In conjunction with a few users who wrote in to support, I discovered a longstanding infrastructure problem that in certain conditions can make container startup very slow. (Specifically, it could take a very long time to pull your app’s Docker image to the app machine.) I just scheduled a maintenance window for Monday to roll out new app machines without this issue. I think the results will be pleasing.
While this issue isn’t new, the new unhealthy container replacement feature made it worse. Before that feature, container startup times could be slow but at least would eventually finish. With the feature, Galaxy kills any container that stays unhealthy for too long, including at startup time, and so it could kill these containers which were taking too long to pull the image before they even started. I’ve raised the timeout for killing unhealthy still-starting containers to a high enough value that hopefully this won’t occur for folks before we fix the underlying issue on Monday. (The timeout for unhealthy containers that have successfully started is unaffected.)