Galaxy - Really concerned about launching a serious app (maintenance reboots)


#1

I’ve been using Galaxy for a while as I have a SaaS in closed beta. Mostly, it’s been a huge headache-saver as far as “set it and forget it,” and uptime has been quite good.

The only pain point for me has been how maintenance periods are handled. There are a few issues, some more serious than others:

  1. An update to a container causes the app to refresh. So if the user is busy doing something, they will experience a page reload unexpectedly. I believe there might be a workaround to this, which is to catch the reload (or “migrate” process as Meteor calls it) with a hook and simply disallow automatic reloads.
  2. The above doesn’t address the server side of the restart, unfortunately. Container restarts are causing me some minor grief with only about 80 beta testers (imagine 1000+ users!). My SaaS relies on REST API calls submitted by AWS Lambda, to report certain processes being completed. When Galaxy restarts my container, there’s an interruption and in some cases users will see stuck processes in the UI because Lambda failed to talk to my Meteor app.
  3. So taking the above two things into consideration, the only thing I can think of doing is to put my SaaS in maintenance mode during the container upgrade/restart period. The problem is, these maintenance periods (for US servers, I might add) are not necessarily during off-peak times. Not so long ago there was a maintenance period that started at around 5pm PDT, and now theres’ another coming up on the 21st that will run from 7pm to 11pm PDT. IMO, this is unacceptable. I can’t just shut my app down for three hours, but I also can’t leave it open and accessible because of points #1 and #2 above. Container restarts occurring after midnight would be much better, I think.

I’d like to hear what MDG has to say about these points. I’m seriously considering not using Galaxy when I launch and just doing a self-service setup on AWS EC2. It might be more painful to set up and ensure uptime, but at least I have full control over when restarts happen.


#2

It used to be the case that when you used larger containers (1GB) and at least 3 of them, a high availablity option would turn on, mitigating these pain points.

Is it not the case any more?

PS: I know that would be a little too expensive compared to a single small instance.


#3

Managing some VPs on digitalocean or aws will finally cost you more. I think Galaxy with high availability and some product management policy such as email and web notificationd will solve your problem. Acting like Facebook or Twitter and staying at 100℅ uptime is very hard.


#4

I was under the impression that a new container was started with your app and all your connections transferred before the old one was shutdown. So ideally that would mean your lambda call would always have someone to talk to ? I’m kinda worried if that’s not the case as I also have some webhooks that get called by external programs.


#5

Are these hiccups a big enough problem that they could derail the success of your app? Or is this a “we may periodically loose 1 in 100 users because of this”.

To me it sounds like something to worry about when you have 20,000+ users and investor backing, not something to worry about during beta, but I’m not sure the nature of your business or your apps value proposition. If your value prop is “no hiccups ever” then I can see how migrating makes sense. If it is just a hiccup, I’m not sure I’d spend a week trying to get setup on AWS + ongoing time investment of maintaining AWS.


#6

As I understand it, high-availability simply spins up a container in three different regions, in case AWS has an outage.

This is what I thought too, initially, but apparently if you have 3 containers A, B, and C, you have users spread out among those three containers. So at some point during a maintenance upgrade, container B has to restart. Those users on container B can’t simply be moved over to another container without disrupting their experience, AFAIK.

It’s not a critical fail, but it’s fairly annoying. Basically, users can upload WAV files which triggers a Lambda function (S3 triggered). This function transcodes the file to MP4 AAC, then sends a success message to my Meteor app. So users are reporting issues where files get stuck transcoding forever, because the Lambda func can’t talk to my app, and therefore the transcoding field can’t be set to false.

I’m not sure which is worse: taking the whole site down for 2-3 hours during Galaxy maintenance to prevent users from encountering this, or just letting a portion of users experience this hiccup, then dealing with the aftermath of removing stuck items.

Maybe @zoltan can chime in on this thread.


#7

I agree with you that zero-downtime deployments are important and I’d like to see some improvements in Meteor in terms of not interrupting the user’s experience if a container has to restart.

That being said, I think it’s also important to code your app in such a way that even if a container is restarted (which WILL happen sometime), that it is resilient and can keep working.

I’m not sure about your app but maybe the Lambda callback can go to the server so that when the user reconnects they will still get the update…?


#8

The problem is that Lambda is not meant to be a long-running process, and I wouldn’t want to keep it running (charging me $) waiting for a connection. I suppose the worst case is, I could flag something in MongoDB to indicate that webhooks might fail, which would temporarily disable just the features in my app that relied on Lambda. That way, during the maintenance period, they could still use 80-90% of the app at least.


#9

What I meant is have the Lambda function update something via the Server and store that in Mongo. So maybe you have a Requests collection that tracks the status of the requests sent to Lambda. When the Lambda returns a result it can be stored in the Requests collection with state=completed. When the user re-connects you can scan for completed requests and notify them, and then delete the Request entry…


#10

That’s a good idea! The problem is that my Lambda func is triggered by an S3 upload, so Meteor has no idea about it until Lambda makes the REST API call.

Another idea I had that I submitted to MDG was to have Galaxy optionally contact your app via a webhook when maintenance periods are announced. That way, it removes the necessity of someone having to check emails from Galaxy and manually set a maintenance period. With a webhook, your app could just know about maintenance periods automatically.


#11

Even if Meteor has no idea until Lambda makes the REST API call it could work.

Definitely Galaxy should have some webhooks for lifecycle events, as well as an API like other PAAS providers eg Heroku have.


#12

Actually, after re-reading your post, I’m not sure how that would even work. Here’s the situation (the failure case):

  1. User uploads a WAV audio file.
  2. Lambda gets triggered as the file is uploaded into an S3 bucket.
  3. Lambda executes ffmpeg to transcode the audio file which could take as long as 15 seconds, maybe 20.
  4. During this time, Galaxy restarts the container.
  5. Lambda finishes transcoding and attempts to send a notification via REST API to my Meteor app.
  6. The REST call fails because the container is being restarted.
  7. Lambda function quits.
  8. DB entry in MongoDB remains transcoding: true forever because no message was received from Lambda.

Maybe I’m missing something. How do you propose addressing this?


#13

Sounds like Lambda should use SQS to send a notification that it is complete. When Meteor comes back up it can read from that queue to reveal what completed when it was down.


#14

Ah! I have no experience with SQS, so I’ll look into that. Thanks for the suggestion. :slight_smile:


#15

@hexsprite I gave SQS a shot. Very cool! The only thing is, I don’t want to have my Meteor process repeatedly poll looking for SQS messages in the queue. So what I’m thinking is this:

Success Path

  1. User uploads audio file to S3.
  2. Lambda is triggered, and begins transcode process.
  3. Lambda function sends REST API call to Meteor server.
  4. Meteor server acknowledges and sets database entry accordingly.

Fail Path (Galaxy is rebooting the container)

  1. User uploads audio file to S3.
  2. Lambda is triggered, and begins transcode process.
  3. Lambda function completes, attempts to send REST API call.
  4. REST API fails since container is rebooting.
  5. Lambda sends a message into the SQS queue instead.
  6. Meteor server boots back up.
  7. Meteor.startup on server side checks SQS queue for messages, and handles them accordingly.

#16

Some points from the perspective of Meteor (nothing here is Galaxy specific):

You can definitely use the reload mechanism to stop updates from occurring. There’s a reload-on-resume package that does that (only reloads when a Cordova app resumes).

However, I think you have a larger problem here – a container restart (to a new container with the same version of your app, not an app upgrade) should not cause your app to reload. Your user may go through a brief period of “disconnected-ness” but typically in my experience it’s not necessarily noticeable unless you have UI that calls it out to the user.

I have heard of apps that do reload in this scenario, so there’s clearly some way to trigger this behaviour, but it’s not something a vanilla Meteor app does, and it could be something you could resolve in your app?

Reading through the thread below it sounds like your are on the right track with this one. You can’t really have an architecture isn’t failsafe if the calling server goes down, because what if you deploy a new version of your app in that time? Even if Galaxy was “perfect” that would still lead to a container restart.


#17

It’s hard to tell exactly what’s going on. All I know is that during a maintenance period where MDG was going to have containers restart after some sort of upgrade or fix, I had a couple users complain about some glitches, which I know were a result of Lambda not being able to communicate with Meteor.

I think using AWS SQS as a fallback might be the way to go.


#18

Hi @ffxsam,

We’ve been working towards minimizing the container restarts for apps on Galaxy. I understand the impact that critical restarts can have on your app, and for that reason we have moved to an after US business hours maintenance window.

We strive for a less frequent frequency of maintenance windows, while ensuring that we can provide the security updates and critical fixes that your app depends on in a timely manner.

We schedule maintenance windows that cover multiple hours, so that we can slowly roll machines and thus minimize the chance that an app’s containers will roll multiple times in a maintenance window.

I’ve added to our roadmap the request to have a webhooks based notification when a maintenance window is scheduled.

We are constantly looking for ways to make our system updates more seamless, so appreciate your feedback here. I hope this answer gives you an idea of our goals and rationale behind our system updates to Galaxy.


#19

This would be absolutely huge. (maintenance start and maintenance end webhooks)

Definitely. Thanks for weighing in!


#20

Yeah, using something like SQS to get state out of the container and into something persistent seems like the right approach here. Many meteor apps use mongo for this purpose, even though it is not exactly suited for it, it works well up to a certain scale.

You might considering using something like SQS not just as the fallback but as the primary means of processing. This is a pretty stereotypical use case for SQS and something it is well suited for. Polling SQS is an OK pattern, and it also supports long polling to reduce the number of calls (http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html).

In general if you can persist all relevant state outside of the container [1] and re-process things on failure [2] your app will be far more robust, scalable, and responsive.

[1] Mongo, SQS, SQL, ES. Whatever works for your data access patterns and scale.
[2] Lots of ways to do this. In the SQS case you can use timeouts to automatically have things reprocess on failure. Other strategies include having the client send pings while it waits that check for task completion and retry on failure, making everything idempotent and having a repeating job that finds whats left to process and tries to make as much forward progress as possible, or using an atomic data store (eg mongo findAndModify) to coordinate multiple nodes serializing task execution with heartbeats to detect failure.