In the past I had good results with something inspired by socialize:server-presence
, which goes like this:
// THE PING PART
// unique ID per server per restart
export const serverId = Random.id();
// collection to hold all instances
const instances = new Mongo.Collection('GatewayInstances');
(async () => {
await Instances.createIndexAsync({ ping: 1 });
await Instances.createIndexAsync({ serverId: 1 });
})();
// keep track of which servers are online
const serverPing = () => {
// there may be other data which you'd like to monitor, so you may choose to add
// it here (alternatively, query it elsewhere live based on what servers are alive)
Instances.upsertAsync({ serverId }, { $set: { ping: new Date() } })
.catch(console.error);
};
// each server must ping regularly
Meteor.setInterval(serverPing, 1000 * 60);
Meteor.startup(serverPing);
// THE ACTION PART
// remove old servers and their sessions, alert about servers being down,
// do something on the servers that are still live, etc.
const checkInstancesAtInterval = async () => {
const cutoff = new Date();
cutoff.setMinutes(new Date().getMinutes() - 2);
const instancesToRemove = await Instances.find({ ping: { $lt: cutoff } }).fetchAsync();
const removePromises = instancesToRemove.map((srv) => {
const removeInstance = Instances.removeAsync({ _id: srv._id });
const doSomethingElse = someOtherAsyncJobBasedOnThisInstance();
return Promise.all([removeInstance, doSomethingElse]);
});
await Promise.all(removePromises);
// do something else with your live instances here
};
Meteor.setInterval(() => {
checkInstancesAtInterval().catch(console.error);
}, 1000 * 90); // every 90 seconds
Meteor.startup(() => {
checkInstancesAtInterval().catch(console.error);
});
Of course, adjust the polling window to suit your needs, but if the number of instances is low, and you use indexing, polling MongoDB even every 10 seconds should be fine for if this is a mission critical service.
Note that here we use a new ID for the server when it comes online, which may, or may not work for you, especially since you envisage needing to identifying the actual servers that went offline. But you should be able to replace the random ID with a string passed at runtime through environment variables or similar.
This approach is quite resilient, and makes you depend only on MongoDB being always up, which, for obvious reasons, should always be the case.
[Addendum 1] A more robust approach would be to use setTimeout
instead of setInterval
. This would also remove the need to execute the code once at startup.
[Addendum 2] It depends on your tolerance for duplicate execution, but if you run more than one Meteor worker on the main server, I suggest either spacing the polling timeouts perhaps by using prime numbers, or use MongoDB for deduplication through what I call unique constraint with TTL (Time-To-Live) indexing. The latter is off-topic, but in short it means attempting to insert a uniquely hashed document of function name and arguments into a collection with TTL for auto-expiration before executing the function.
[Addendum 3] I realised I haven’t answered this question:
Simply subscribe to a publication that returns a cursor on the Instances
collection.