Synced cron, job collection or both together?

longlegs · January 4, 2016, 2:28pm

Let’s say I have an app like Google Analytics. One of my apps features is to check either you have installed the tracking code correctly or not. For that I need to scrape all of the provided url’s in the database (let’s say with Cheerio) every 10 minutes since the url was added, check if the snippet is present and save the status back to the database.

What I’ve found out till now:

I could use meteor-synced-cron to create a scheduled cron job, run it every x minutes and scrape all the urls that haven’t been checked more than x minutes ago. But then how to divide the job to multiple workers since the whole batch is scraped in a single cron, on a single and is running on the app itself?
I could use meteor-job-collection to schedule the check as a new scheduled job each time an url is added (check this particular url every ten minutes). In this scenario there would be as many jobs as there are url’s in the db). Is this a problem?

Right now I am thinking of a mixed thing. I have a synced-cron that runs every x minutes and then adds a job to scrape the urls that have been scraped more thab x minutes ago and are not defined as pending in an already running job. So the cron would run on server like:

  SyncedCron.add({
    name: 'URL test',
    schedule: function(parser) {
        return parser.text('every 10 minutes');
    },
    job: function() {
        var sites = Sites.find({
            lastCheck:{$lt: moment().subtract({minute:10}).toDate()},
            status: {$not:'pending'}
        }).fetch();
        new Job(myJobs, 'checkURLS', sites);
    }
  });

And then do the scrape&save in a meteor-job-collection job in separate nodejs worker apps (so I can divide the jobs to as many workers as I need)

Any suggestions or similar experience on how to deal with this kind of task?

johncrisostomo · June 9, 2016, 2:56am

Too bad this did not receive any replies, I am having the same dilemma now. What are your findings about this? I hope someone will have this explained (pros and cons) or at least share their best practices regarding this issue.

human · June 9, 2016, 9:50am

We are doing a lot of background jobs (e.g. taking screenshots, sending scheduled emails, checking for external apis…) for a Meteor application. Here is how we do things:

We don’t do any background jobs (things that don’t need a direct response on user interaction or request) in the Meteor application itself. Three reasons – in case of fire (a job takes too long, hangs or raises a fatal error), the main application is not interrupted and continues to work without interference; the main application is not busy with unnecessary computing load and we can update/scale the workers independently from the main application.
For scheduled jobs (e.g. go through all users and update their profile picture) the worker connects to the database directly and the Meteor application is not involved in this process
For jobs that need to be triggered (e.g. when a user updates, create a new profile picture) we use monq. The main Meteor application adds a new job document and one of the free workers catches it up and does it.
The workers themselves are plain node applications, lightweight and contains only the code to do the specific job, wrapped as a docker container. That way it’s easy to maintain, edit and scale them.

Hope this helps. If you have any specific questions – ready to answer them.

johncrisostomo · June 13, 2016, 2:05am

My I know if you are using any npm packages to handle these jobs and workers (aside from monq), or are they all written from scratch? How do you monitor the workers’ status? Thanks in advance.

AnthonyAstige · June 13, 2016, 2:39am

I use vsivsi/meteor-job-collection with a couple custom node worker scripts also. I run the workers via pm2 on the same server, and can later be separated out to scale.

Each worker has a few npm imports, here’s a list from one of them partially requirements from vsivsi/meteor-job-collection, partially from my use case.

const FS = require('fs')
const DDP = require('ddp')
const DDPlogin = require('ddp-login')
const Job = require('meteor-job')
const Mongo = require('mongodb')
const FBGraph = require('fbgraph')
const stringify = require('node-stringify')
const _ = require('underscore')

johncrisostomo · June 13, 2016, 2:45am

Thanks for the reply! I will look at pm2, seems like what I need to monitor and handle the workers.

AnthonyAstige · June 13, 2016, 2:55am

PM2 has been great for me so far. I also use nvm so I can run my workers on node 6.X and Meteor on 0.10.x, all with pm2 which appears to be compatable with both node versions.

I haven’t gotten deep into Docker which could serve the same purpose. I think Docker has a steeper learning curve but is the way of the future and more robust.

streemo · June 13, 2016, 3:27am

If you want a very simple solution and you want to keep using Meteor without getting into raw node, you can build a separate horizontally scalable backend app which connects to the same database and runs a simple cron job:

//in an interval
let random = Math.random()
//create a wrapper around the mongodb internal _findAndModify.
let currentJob = Jobs.findAndLock({randomTag: {$lte:random}, /*not pending or locked, and 10 minutes old*/})
if (!currentJob){
  //clear the interval and recursively reset the interval?
}
//do the job, then unlock it, if it fails, unlock it and don't reset timer.
Jobs.update(currentJob._id, {/*the job has finished, unlock it and reset the timer*/})

//set a separate interval to clean up permalocked documents
//set a separate interval to periodically update randomTags on the jobs, and define new jobs.

Obviously, this isn’t nearly as robust as some of the other solutions, but it does let you scale a simple cron job if you only need a few simple periodic tasks.

Instead of using intervals, I prefer to use infinite recursion because then jobs will be executed in sequence (job generator, instead of some predefined interval). Also, you can control how soon to start a job after the start of another job. For example, if one job fails, why wait another 100 seconds? Try to start the next job by recalling the function from within itself.