MUP deploy with downtime - is it possible to get rid of downtime?


#1

When we deploy via MUP we do have a small downtime, which is quite annoying. Still ok right now when we push at night but it’s a problem for hot fixes. How can we get rid of downtime?


#2

Even with rolling updates there is no way to update the site without downtime (AFAIK). You allways need to restart the server to load the new code.


#3

As far as I know Galaxy has a better downtime prevention I believe. Does anyone have experience with this?


#4

As you noted, the simplest option is likely to go with Galaxy, as they advertise “Zero downtime coordinated version updates”.
Another way to eliminate downtime would be to add some extra steps and more servers to the build process.
For example, you could set up your Internet-facing server to simply be Nginx with a reverse proxy that points to your application server. All traffic coming to your site will go to the Nginx server, which will then forward the traffic on to your first Meteor server (Meteor 1). You will also have a second Meteor server (Meteor 2) that is normally doing nothing. When you deploy, you will deploy to the Meteor 2; once the deploy is complete and your updated site is up and running, your deployment script will update the Nginx configuration to point all traffic to the Meteor 2, thereby switching your users over with no downtime.

Initial:

         {Internet}
             |
       [Nginx Server]
        |          
   [Meteor 1]  [Meteor 2] <- [Deploy update to Meteor 2]

After deploy:

         {Internet}
             |
       [Nginx Server]
                   | 
   [Meteor 1]  [Meteor 2]

Note: I say “servers” here, but this could also be done with one server using Docker containers.


#5

Yeah the way Galaxy does zero downtime is by completely starting the new container before switching and shutting down the old one. It also manages directing new connections and reconnections to the right places during this process, to avoid a stampeding herd problem where every client needs to reconnect at once.


#6

Similar to Galaxy, NodeChef does container replacement as well. A wait period is allowed for the new container to boot and of course bind to a socket. The load balancer knows when the old container is dead
as an event is automatically triggered if there are already connected clients. If not, an asynchronous attempt to connect should fail automatically (C++ style socket select and write). It also knows of the NET address of the new container before it is even started as the two containers will always bind to a predefined NET address. This works flawlessly for apps deployed on NodeChef.


#7

What about PM2-Meteor / PM2?