High Availability - Galaxy vs DIY?

wildhart · April 9, 2019, 11:30pm

I’m considering a new app which would be quite low demand and not much of a money spinner but would require high availability when it’s needed in the event of an emergency.

It seems that the easy but v.expensive option is Galaxy + mLab/Atlas.

However, what success have people had with DIY high-availability deployments, where the servers and db are each in multiple data centres?

I’ve seen this by @satya, but that’s one data centre so strictly speaking not high-availability.

Any tips?

znewsham · April 10, 2019, 3:31am

We use AWS for both mongo and meteor. We’ve got DNS setup with failover routing based on a health check from our primary datacenter to our secondary. Each datacenter (region) has an elastic load balancer (replicated for redundency) which routes requests to servers within different availability zones (AZ)'s in the same region.

We run a 7 node mongo replica set 3 each in the two region’s we support, and a 7th in a third region, this way we can tolerate an entire region failing - this has never happened, though we have had individual AZ’s go down.

Dramatically cheaper than Galaxy + Atlas and more secure with no external traffic between DB servers or app/db servers. We use VPC tunnelling between the regions.

The only concern would be in the case that all the app servers failed in one datacenter, and the database servers all failed in the other - in this (extremely unlikely) scenario, we’d have insane latency on all our DB calls.

DNS failover isn’t particularly fast either (due to DNS caching). Setting the TTL low helps, but it’s not perfect. We mitigated this by also having direct access URLs which bypass the DNS failover to directly access the application servers in each region. We’re primarily a b2b service, so in the case that we did have an entire datacenter go down, we could communicate these backup URL’s to our clients.

wildhart · April 10, 2019, 4:08am

Thanks for this @znewsham. Can I ask approximately how much this all costs and which instance types you use?

Obviously getting that all set up was time consuming, but what about maintenance and redeployment, how easy is that?

znewsham · April 10, 2019, 4:52am

We have four different, but related apps that run on different servers - we don’t need close to the capacity we have in terms of CPU, but we like the redundency. We run 3X t2.small and 9X t2.micro, basically 2 servers for each app in the primary region and 1X each in the secondary. Should a failover occur we can spin up extra quite quickly.

In theory those 4 instances in region2 sit idle - so we could probably drop them to the smaller size, but the piece of mind of having them available is worth the cost to us.

Each server has a 16GB EBS root.

We also run 7X t2.mediums with 50GB data drives for mongo.

We use 2X ELB (one per region, we share them between the apps)

The total for these pieces is around $290 per month with reserved instance pricing.

The equivalent price of the app servers in galaxy would be around $900/month

The total for an equivalent Atlas deployment using M40’s (not sure how they compare to t2.mediums) would be around $1700/month.

In addition to this we pay for data costs for the cross region replication of the DB - this is extremely hard to estimate.

Redeployment is pretty trivial - the mongo servers basically sit untouched, so no effort there until we need to upgrade (which we have twice) it’s laborious, but not difficult.

The deployment for the app servers is OK, we have a deployment server in each region and we use MUP with a custom plugin that ensures the load balancers and target groups are configured correctly and does a rolling deployment in the primary region and a regular deployment in the secondary region - it’s very much fire and forget, I kick off the deployments and go get coffee, come back and check they’re finished.

We don’t have autoscaling, but with the cached MUP build, and our plugin, we can quickly spinup new instances to register with the load balancer.

evolross · March 29, 2020, 3:03am

@znewsham

This is awesome! Could you write a tutorial on this!?