How to handle long server processes

mattkrick · March 24, 2015, 7:42pm

I’ve never had a project where a server process took more than a few milliseconds, and now after researching patterns to accomplish this, my head is spinning & even though this topic is covered in other non-meteor specific communities, I’d like to hear from the one I trust

I’ve got to solve an NP-hard problem & my thought is to let it run for 15 minutes & grab the best solution after the time limit is reached (I’ll run this about 20 times / client / year).

Questions:

Is V8 fast enough? I’m probably going to write the algorithm in JavaScript as well as Julia just to compare for my own sanity, but I’d like to hear from community experience.
If I were to host this on my own, what’s the general pattern? Passing this onto my event loop would hang the server for 15 minutes, right? So what’s the alternative?
Is there a rule of thumb for determining when to use JavaScript vs. switch to a faster language? E.g. cost of switching languages = 10 seconds, Julia is 20% faster than JavaScript so the breakeven point would be 50 seconds.
If I wrote this in Julia, I’d probably use some hosted service like http://forio.com/products/epicenter/. Any experience or thoughts on this approach/vendor?
Any packages that make this easier? I imagine if I call an external server I could just change a Session var in the callback & that would update my template, but I don’t know what I don’t know…

Don’t be bashful, any thoughts or opinions welcome!

jchristman · March 24, 2015, 8:50pm

One strategy (that I have used) is to spawn another process that does the work and updates the mongo DB directly. Since meteor is tailing the oplog, you will still get the changes live if you are querying a certain collection, and then you don’t hang the server in the process. This also has the added benefit that it’s not taking up processor time from the single-threaded node.js process and won’t slow down your meteor server. This post outlines a strategy for this. I’ve done this with Python scripts that take a long time to run (multiple minutes) and it works beautifully.

If you really want to do it all in meteor, you could start the function with Meteor.defer(function () {}), which will run the function asynchronously after the current function ends. This shouldn’t hang your meteor process either, though it would likely reduce the responsiveness by a certain amount.

I’m not familiar enough to comment on the Julia strategy.

nathan_muir · March 25, 2015, 2:52am

Currently I’m using celery (python) (backed by rabbitmq) to offload any long-running processing. Results can either be returned from the task, or written directly to mongo for reactive updates. (Using package 3stack:celery)

If I had to do it again though, i’d probably pick: beanstalkd + a npm package + client language & library of choice

With beanstalkd you have more options for creating jobs & processing jobs - Make workers out of NodeJs, Go, C, Perl, Php, etc etc

alanning · March 25, 2015, 2:56am

Building on what @nathan_muir outlined, the general solution for executing work in a scalable way is to use a job queue. RabbitMQ and Amazon SQS are two popular, highly-available, durable queues.

Write your queue consumers (workers) in Julia or whatever language is suitable for your use case, put the output data somewhere your Meteor servers can access it, then update the MongoDB to indicate completion.

I’m hesitant to even mention these because people may use them because they’re “cool” rather than appropriate and end up creating additional work for themselves, but… You may also want to look into using Hadoop or Storm. Just remember, pragmatism is king.

landm · March 25, 2015, 5:20am

Following on @alanning, in terms of scaling and a queue. I’m in a situation with multiple app instances across multiple servers with jobs executing most frequently based on a schedule, but that only need to run once. I just needed a quick solution and so I use synced-cron to fire off a job which runs on whichever instance. The result of those jobs write back to mongo. It was quick, and worked and didn’t require much extra setup.

If I were to do it again, or possibly even now, I would look at cluster and set up a microservice to receive requests and provide the analytics services and return the result. This would allow to have dedicated instances to analytics that could be scaled separately and or simpler changes if I were to change the analytics pipeline.

Edit: It looks like job-collection provides pretty much the whole suite of job management and allows you to run your jobs anywhere (meteor client, server or as a separate node.js/system worker).
Also, there is a node-julia package which you could integrate directly into meteor and just wrapAsync the call to avoid blocking the event loop. If you’re just looking for low volume, on demand running of julia scripts. That’s a pretty straightforward solution.

mattkrick · March 25, 2015, 4:31pm

Thanks for the great list of things to research & I apologize in advance if these questions seem simple (I still have a lot to learn about nodejs & sysadmin, any suggestions on training material are welcome!)

It seems the simplest way would be an async call to node-julia, but this means I’d still have a 15-minute event & when that beast reaches the front of the event loop, it’ll cause a hang, right? There’s no way to have it yield to smaller processes that come in after it & then pick up where it left off like this.unblock()?

What’s the trade-off between a hosted solution (like the aforementioned Amazon SQS) and creating a child-process? I’d assume for infrequent processes it’d be more cost-efficient to use SQS?

Is there any way to return progress? E.g. current best result with a user option of stopping the process & accepting the result so far as “good enough.”

landm · March 25, 2015, 11:32pm

I suggest checking out this post about async and event loops by arunoda. Fibers and Event Loop. Meteor uses fibers and you can use Fibers and/or Futures to mix any sort of synchronous and asynchronous code (including subsequent calls) into your app. This video does a great walkthrough Chris Mather: Understanding Event Loop.

For returning progress, it depends what you use to execute the job. I have a job that runs as a child-process and utilizes the Mongo C++ Driver to write out progress, status and errors to the database. There is an observeChanges cursor in the app that updates reactively to the user. In a general sense, if you can write out to the database, you can provide feedback to the user. You could also use a message queue to return status messages back to the user if you have your job emit them.

As for the tradeoff, definitely a question of how resource intense the job is. So for NP-hard and 15-minutes, I’d agree your best bet would be to go with a hosted service or some separated (not application) instance to execute the jobs.

alanning · March 27, 2015, 6:27pm

Regarding the trade-offs between an external message queue and using something like Meteor.setTimeout / child-process, a lot depends on what your use case is.

Meteor/NodeJS-based option:

Simple
Depends on meteor/nodejs process to stay alive
Workers written in javascript

External message queue

More complex
Durable, high-availability configurations
Consumers can be any language

For simple cases where the work isn’t critical, something like meteor-synced-cron (which I believe uses Meteor.setTimeout) is great.

For something where you need more of a guarantee that the work will be completed or the computation needs are heavier, external queues offer a better bang-for-your-buck.

You can of course implement your own queue in nodejs but then its a question of whether its cheaper to reimplement the parts of the other queue systems that you need or spend the time learning the other systems.

mattkrick · March 27, 2015, 7:51pm

Thanks Adrian! Based on the feedback, it looks like the best pattern (for production) would be sending my inputs to an external service, having that service send a message every time a new “best result” is found, update my mongodb with that new result, and then flip a finished flag when the time limit is hit or the user stops the process. Is that the general idea?

Something you brought up that I never thought about was the guarantee of completion. Is that referring to a case where, if my site didn’t scale, a user would be denied opening a new child-process?

Thanks again to all for taking the time to guide me through these baby steps.