How do you achieve zero downtime deployments?

Hi everyone,

I want to ask the Meteor community about how you guys handle your own (production) deployments and if/how you achieve zero downtime when deploying a new version.

Our own setup looks like this:

  • 2 identical docker containers running our app, which sit behind a
  • traefik proxy (in its own docker container), handling the routing and the load balancing for these 2 containers

When we deploy a new version, we simply deploy container #1 first, wait ~10 seconds and then update container #2. The problem is, that due to sticky sessions, the traefik router/loadbalancer needs quite some time to really realize, that e.g. container #1 is now down and re-routes to #2.
Also, directly after starting up any container, Meteor needs up to ~30 seconds to be “really done with startup” and the CPU load goes down from 100% to a more acceptable level.

My question is: Do you guys have any input on what we could improve in this process? I would like to achieve a real zero downtime deployment, where at some point the containers just “switch over” from the old to the new version and all the user sees is a short page reload. - right now, if the user is online while we are deploying a new version, he sees a loading spinner of up to 30-40 seconds, which really is not nice!

Thanks everyone!
Cheers, Patrick

1 Like

I use AWS EC2 instances directly, with an ELB in front of them.

The deployment script de-registers one target, deploys, re-registers it, waits for it to be healthy, then moves onto the next host. This works great so long as you don’t have breaking schema changes - since you could run a migration on the first server that comes up, then after that runs your old server writes more incompatible data. There are ways around this of course (e.g., writing idempotent migrations and backwards-compatible schema changes)

Most of my projects have relatively few servers (the most is 8) so this works well. Any more than about 10 and I’d implement batches (take down 3 at a time for example).

Honestly, though - I’m looking at moving everything to k8s, where this gets handled for you.

In terms of meteor startup - the server doesn’t respond to requests at all until after all the startup hooks have fired - so if you have health checks, they also won’t respond until that point. So if you follow any deployment procedure that rolls based on health, you’re good.

1 Like

I’m running Fargate on AWS, similar to @znewsham I just deploy a new version in Sourcetree (which I prefer over git command line). AWS takes care of everything else, it launches the new build, at that moment the new and old one are running in parallel. When the new one is running fine, all new connections go to it and AWS will start to move the user from the old server(s) to the new one (we do have rules in place to scale-up and down automatically). When there are no users on the old server, it’s taken down. This usually takes only a couple of minutes.

Same as above… ELB and health checks.

We also handled this with a notification at the bottom telling users that a new version is available (using Meteor’s feature to detect new builds). They can continue using the app (e.g., finish what they are doing) or choose to click the reload link.

Most of the time, we just chose a few minutes of a maintenance page and deployment during the wee hours of the morning. Good thing it was rare.

2 Likes

For zero downtime you can simply update the files, and then restart the app - it’s just the node main.js line.

I do this inside of a screen so it’s dettached from the shell and running as a process that i can monitor.

So in the most simplest way you just copy the build tarball to the server, decompress it and then start the process again because the app is running in memory, so killing and restarting is how you will do it and retain all session and never log anyone out. It will mean hitting ctrl+c to kill the process, then just press up and enter to start the node main.js again. If you use React for the frontend no one will notice.

It’s interesting approach. So I guess you don’t use the “hot code push” feature.

We are using it.

By default, when a new build is ready, the app will be notified and that notice will trigger the reload.

We just separated the notice and the trigger and placed a UI in between. Notice displays the UI, and UI triggers the reload.

I can pinpoint the exact functionality once I’m with my laptop

2 Likes

Thanks for your replies everyone!
I just did some tests of the “startup time” of meteor.
On a pretty potent test server without any traffic, from running docker run of an already downloaded docker image, until the user can actually use the app in the browser, it took 11 seconds.
Is there any way to still improve this?
If I understand your argument correctly, you basically just accept that the client loses its DDP connection during deployment?
@rjdavid how exactly do you check for “new version” and update your UI as you explained?

We are using the Meteor’s reload package: meteor/packages/reload at devel · meteor/meteor · GitHub

Packages are using this to allow the “migration” of the session to the new version.

We can hook to this functionality to stop the reloading of the app from loading the new version. The simplest way to stop the reload is by calling this simple function.

Reload._onMigrate(() => {
  return [false];
});

You can read the rest of the functionality from the package

Relevant discussions here: Allowing users to trigger hot reload when they're ready

1 Like

We have multiple servers behind a load balancer. When deploying, server1 is set to maintenance mode at the load balancer level, so that it doesn’t receive any requests. The currently connected clients pick this up and restart the web-socket connection in the background, ending up connecting to a different server running the same old version of the app. It’s unnoticeable for the user.

Then, we deploy the new version to server1 in maintenance mode. After a successful & healthy deployment, load balancer sets the server1 to drain-mode, meaning it will accept new connections only if server1 is specifically requested via the sticky session cookie. Run any required tests against the new production deployment, and if everything seems fine, set server1 status to ready in load balancer, meaning it is now running as usual.

Repeat for server2, server3, server4 etc…

We’ve also disabled automatic reloading when client detects new code, instead an UI prompt is shown to reload the page.

Seems to work :man_shrugging:

3 Likes

We use load balancer too, it’s Google Cloud service. it also has ability to auto scale, adds more or removes server as needed.