Work-Arounds for Galaxy Meteor Cloud Planned Downtime

evolross · December 9, 2021, 6:18pm

Galaxy/Meteor Cloud has always historically had an amazing up-time record. So Meteor’s recent announcement of up to a thirty minute downtime in each region, would certainly be a massive shock to our user base, who are admittedly spoiled on years of uptime.

I wanted to start a new thread to discuss work-arounds to this and future maintenance downtime.

Is it as simple as rerouting your DNS to a Galaxy/Meteor Cloud region that is currently up? As not all regions go through maintenance downtime at the same time.

How long would an update like this take to propagate? We would just need to update the DNS a few hours before any scheduled maintenance?

Of the Galaxy/Meteor Cloud three regions it looks like they go in this order according to time:

AWS ap-southeast-2 in Sydney, Australia
AWS eu-west-1 in Dublin, Ireland
AWS us-east-1 in Virginia, USA

So for example, if you used us-east-1 to host your app, then the best move would be to get up and running on eu-west-1 since it’s the clostest, then DNS switch to it after the maintenance of eu-west-1 completes. You’d have five hours for the DNS to update. Then you’d switch back after the maintenance is over in us-east-1. Then you could take down eu-west-1.

Does this sound like a plan? Anyone else have any ideas. Keeping it within Galaxy/Meteor Cloud and not leaving the platform. Standing up outside of Galaxy/Meteor Cloud is always an option too.

filipenevola · December 9, 2021, 6:29pm

Hi, speaking from the inside, we also have other options, and we have been analyzing them in the past weeks, and we are going to keep with this work to find a solution soon. The idea is always to require zero efforts for our clients.

To provide one example, we could create a different load balancer and then point the ingress to the new load balancer, isolating the change from the apps altogether.

But if we also create a new cluster of machines for the apps, maybe duplicating the app clusters could cause unexpected behavior in some apps? That could be a problem.

We could also do this in two parts, first the proxy layer and later the clusters.

I believe we will find a solution for this case where we have essential changes in the different layers and all the other maintenance we do without any problems as they happen in a rollout fashion.

We do maintenance many times each month without any notice for all the clients. Check the past two years without any planned downtime So I believe we are going to figure this out soon as well.

Now talking from the outside, the DNS plan would work but maybe latency and DNS propagation times would be not ideal.

evolross · December 9, 2021, 6:38pm

Well, we’d love a zero-downtime solution. Like Meteor, in the 7+ years of running our app we’ve never had one planned downtime either, so we would love to not start now.

Planned downtime is so very Web 1.0.

By latency do you mean application latency in having US users connect and use the app in Ireland? If so, that’s fine. Especially for only thirty minutes. I’ll take slight latency over an outage. And, we only host on us-east-1 and never hear any complaints from users overseas, for which we have many. So that type of latency isn’t issue.

Would DNS propagation times be an issue if there’s a five hour window to switch to eu-west-1? That should be more than enough?

filipenevola · December 9, 2021, 6:49pm

I agree! This is what we are going to try to avoid.

30 minutes was the window for the update, even in the worst case we were expecting downtime for apps of a maximum of 1 or 2 minutes depending on how fast AWS was going to apply the change but we are going to find a way to avoid this as well.

And yes, I’m talking about the latency for users but as you already have global users accessing us-east-1, this is probably not a problem in your case, but remember as well of your database, that would be far away or you would need to move it as well.

DNS is a mystery, every provider says to wait 24 hours but in reality, it is propagated usually really fast almost all the time. But 4 hours seems to be a good window.

We are going to try to avoid that as much as possible and keep our record of no downtimes.

So this announcement was an opportunity for us to improve, do better communications and even create a better plan for this change