Galaxy - Really concerned about launching a serious app (maintenance reboots)

nim · September 22, 2016, 6:19am

Statuspage lets you do this already! Go to http://status.meteor.com/, click “Subscribe to updates” and select the webhook option. (TIL!)

ffxsam · September 22, 2016, 7:00am

Hmm… lots of different approaches for sure. Right now I’m having the Lambda ffmpeg transcode process fire off based on an S3 upload. I could probably change that so as soon as the Slingshot upload is done, the callback calls a Meteor method which invokes the Lambda function directly.

I’m still not sure what the polling process would look like exactly. I don’t want the user to wait any more than they have to while their audio track says “transcoding…”. I don’t know if it makes sense to use SQS’s receiveMessage and poll every 2-3 seconds? I still have quite a bit of reading up to do on SQS.

You, sir, are the hero of the week! Thank you!

lucfranken · September 22, 2016, 7:32am

Here you can find the event sources: http://docs.aws.amazon.com/lambda/latest/dg/use-cases.html

As you see SQS is not included but with DynamoDB or Kinesis you can achieve quite the same. You can create a stream. It doesn’t really matter which one you prefer.

So what happens in general (between brackets is optional):

IN: Upload to S3 -> (DynamoDB stream) -> Lambda -> FFMPEG
OUT: FFMPEG -> Lambda -> (DynamoDB stream) -> Lambda calls to Meteor -> (DynamoDB delete)

So the steps if you put a queue in between:

Upload to S3, just fine
Insert into DynamoDB stream:
Lambda (What you already have, call FFMPEG)
Same Lambda add or update DynamoDB to set processed = true ( http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.html )
That change in the DynamoDB stream creates a new call to Lambda
The Lambda calls Meteor
When that call has succeeded or timed out remove it from the stream.

See also: https://medium.com/@PaulDJohnston/how-to-do-queues-in-aws-lambda-f66028cc7f13#.au3edkrcd

To reduce the load on your Meteor app it’s possible to make step 6 to send a list of all processed items at once. So you reduce the amount of calls needed.

ffxsam · September 22, 2016, 7:54am

Hmm, this feels over-engineered to me. It would be so much easier for me to just rely on status.meteor.com webhooks to receive notice of a maintenance start date/time, and at that time, disable the audio file upload that uses Lambda (the rest of the app will be accessible as usual) to remove any chance of failure.

nim · September 22, 2016, 8:17am

One really simple approach here would be to add retry logic to your REST API call from Lambda. That should take care of the case when a container restarts, as the next request will get routed to a fresh container.

robfallows · September 22, 2016, 9:36am

Is there any reason you couldn’t have your lambda function update your MongoDB? Then it would be up to your Meteor app to observe the change and act on it whenever it was ready.

marklynch · September 22, 2016, 11:44am

So just to get this straight in my head. When I deploy a new version of my app, Galaxy

Starts a new container and waits for it to be available.
Switches the connected clients to it.
Stops the old container.

Meaning in a deploy my webhooks should be reachable the entire time ?

Is it feasible to use the same process for maintenance updates ?

ffxsam · September 22, 2016, 3:52pm

That’s possible, but with the caveat of having the Lambda container running longer, which of course would cost more. And even under normal circumstances, sometimes these processes run 10-15 seconds on a 1.5GB container! (ffmpeg can take awhile)

Now there’s something I hadn’t considered. Especially since all the API endpoint is doing is this:

postRoutes.route(/* api url */, function (params, req, res) {
  const { newKey, seconds, size } = req.body;
  const track = Track.findOne({ s3Key: newKey });

  if (track) {
    const options = {
      duration: seconds,
      size,
      transcoding: false,
    };

    if (size === -1 && seconds === -1) {
      options.transcodingFailed = true;
    }

    track.set(options);
    track.save();
  }

  res.end('OK');
});

The Lambda func could just as well use an npm mongo library to update this directly in the DB. I like it!

ffxsam · September 22, 2016, 4:50pm

BTW, this is the thing you miss out on being a solo freelance dev. No other developers in the office to bounce ideas off of!

alextondello · September 26, 2016, 5:13pm

Wow this is interesting topic! Let me see if I got this straight (I also don’t have other developers in the office to ask for help).

My application uses https://atmospherejs.com/tmeasday/presence to check if both users are present in a chatroom. When both users leave the room, the server ends the conversation for good. There’s no way to go back to the same chatroom.
I was using AWS EC2 and doing deployments when there were no users connected. I had to do this way because a deploy would disconnect users and since both left the room the server would end their conversation.

I decided to switch to Galaxy thinking that this problem would not happen. But I’m guessing I was wrong. Even Galaxy is going to disconnect both users at some point, right?
I’m still going to have problems with tmeasday:presence package unless I rewrite my application another way that is more resilient to server restarts / clients re-connections.

Is that right? Or is there another way to present Galaxy from disconnecting users when deploying my app?

lucfranken · September 27, 2016, 7:11am

You can test how it works for your app in a very simple way: have 2 containers online and then recycle one of them. Or just deploy a new version. You will see it starts to upgrade one after another. In your case you should not have much issues unless that package doesn’t work on multiple servers well. A normal publication won’t give issues.

The case of @ffxsam is a different one.

tmeasday · September 29, 2016, 1:07am

It does! The issue isn’t that there’s no webhook available, it’s that a long running webhook (lambda process) can get killed before it’s finished because the container is killed to make way for a new container.

tmeasday · September 29, 2016, 1:08am

Because of the way rolling updates work they’ll only ever be disconnected momentarily (as they switch to a new container), so my advice would be to just introduce a timeout into your logic, and you should be good. But @lucfranken’s suggestion to test it out is definitely advised!

hwillson · September 29, 2016, 1:11pm

Wow, this should definitely be in the Galaxy Guide!

ffxsam · September 29, 2016, 6:59pm

Here’s the documentation on what the webhook POSTs look like: https://help.statuspage.io/knowledge_base/topics/webhook-notifications

ffxsam · September 29, 2016, 7:03pm

One more question: is there any way to prevent the page from being reloaded upon a code push or maintenance reboot? If I have at least two containers, will that address the issue? I started working on a hack:

import { Reload } from 'meteor/reload';
import store from '/imports/redux/store';
import { reloadPending } from '/imports/redux/actions';

Reload._onMigrate(function (retry) {
  store.dispatch(reloadPending());
  return false;
});

But as mentioned above, this doesn’t stop the server from reloading, which may cause some out-of-sync issues with the client (very bad).

looshi · September 29, 2016, 7:32pm

that seems like the simplest thing to do = have lambda update the db directly.

tmeasday · September 30, 2016, 2:30am

These are very different situations.

When you push a new version of your code, the user will at some stage get booted off their old container and reconnect to a fresh new container. The techniques and issue you mention above are definitely the route forward with trying to control this. As I mentioned, take a look at the reload-on-resume package.

We put a bunch of effort into this as part of the Verso app, which just pops a dialog saying “an update is available, click here to get it” when it connects to a new server. We actually didn’t have a lot of problems with outdated pubs/methods causing problems (I guess they don’t tend to change that much), but of course the potential is there.
When the container infrastructure restarts, the user will lose the old container and reconnect to a new container with the same app version #. This is (sort-of but much quicker) equivalent to losing your network connection and reconnecting to a new server. The app should not reload, and it’s typically fairly seamless. The publications all need to re-establish but usually that’s pretty fast in my experience.

ffxsam · September 30, 2016, 3:01am

The reload-on-resume package says: (emphasis mine)

Add it to your Meteor app with meteor add mdg:reload-on-resume. This package changes the behavior of Meteor’s hot code push feature on mobile devices only.

Normally, your app will update on the user’s device as soon as you push a new version. This process is always smooth in a desktop web browser, but might momentarily interrupt the user’s experience if they are on a mobile device.

Yet we’ve just talked about how pushing an app update will cause a reload, regardless of whether it’s desktop or mobile. Or is the README incorrect, and reload-on-resume will prevent auto reload on desktop and mobile both? I suppose I could just try it.

tmeasday · September 30, 2016, 3:13am

I didn’t mean to imply reload-on-resume would solve all your problems; just that it was a good spot to look to learn more about how to do this stuff.