Mongo write NOT atomic?

dolle39 · October 18, 2022, 6:59am

I have a problem. I have a collection with jobs that runs in the background.
To prevent two meteor processes hogging the same job at the same time the process takes a lock on the document. However, sometimes I see that two processes takes the same job at the same time even though the update operation should be atomic.

Can anyone see why this operation NOT is atomic? From what I have read all updates in Mongo should be atomic.

const lockTaken = Jobs.update({_id: someId, locked: {$ne: true}},
    {
      $set: {locked: true, handledBy: getProcessName()},
      $push: {handledByDebug: [{process: getProcessName(), timestamp: new Date()}]}
    });

if (lockTaken) {
 runJob(someId)
}

NOTE: I have 10 serverns running jobs which polls for available jobs which makes it likely that two processes takes the job at the same time. But during a day, there are only a few times that I see that two processes takes the same job.

DanielDornhardt · October 18, 2022, 9:13am

Sorry, if have no update for you…

This looks pretty clever, and if this should indeed be atomic as we think what that means then you’re right, no two processes should be able to take on anothers’ task.

The only thing which could disturb this maybe - are there multiple MongoDB instances which could answer this? Or is it just one?

Another idea would be to look into other Jobs packages - for example msavin:sjobs or GitHub - vsivsi/meteor-job-collection: A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere. ? - and see how they solve this maybe?

msavin · October 18, 2022, 11:51am

I’ve had this problem with my msavin:sjobs package - the problem is, if you query very quickly after a write operation, the write might not have propagated by the time your next query is returned. Here’s an idea of how you can fix it:

keep a list of the recent document ID(s) that you have written to
add _id: { $nin: recentIds } to your db commands

znewsham · October 18, 2022, 1:20pm

I’ve used this exact pattern for video transcoding, with great success. The only difference I see between mine and yours is that I used findOneAndUpdate (the code was outside of meteor). I suspect some meteor funkiness is causing you trouble.

Collection.update will return the number of documents matched not necessarily modified - I suspect you might have better luck if you provide a callback and wait for the callback, which will tell you the number of documents modified.

minhna · October 18, 2022, 5:05pm

Have you tried findAndModify function? E.g: Jobs.rawCollection.findAndModify()

generalledger · October 20, 2022, 4:36pm

Not sure if it will help, but it might be worth trying to use mongodb ‘transactions’.

rjdavid · October 20, 2022, 4:45pm

Transactions are meant for multiple docs. For single doc, findOneAndUpdate() or findAndModify() should suffice

generalledger · October 20, 2022, 5:56pm

ahh yeah, i was thinking the transaction might delay reads until all the shards were updated, but it looks like a mongo read concern might help? https://www.mongodb.com/docs/manual/reference/read-concern/

NodeChefMatt · October 20, 2022, 9:01pm

This approach will fail if the database implements only document level locks for update, delete operations.
This is what is happening.
Both servers hit the database server at the same time. Both requests will be processed concurrently. The update operation first queries the database. At this point, two threads both have read the document which evaluates to “locked: {$ne :true}”. One thread locks and updates, the other thread proceeds to lock and then update. Note, in this scenario, the second thread that has to wait to update the document has no knowledge the criteria for which the document was initially selected has been invalidated.

For global locking systems across multiple processes implemented on top of any database, you have to be certain you are aware of how the database handles concurrency else you are in for some surprises.
You can also report this issue to the MongoDB team for their thoughts on how they handle document level concurrency on updates.

rjdavid · October 21, 2022, 2:12am

Here is the MongoDB documentation about atomicity, transactions, and concurrency controls

paulishca · October 21, 2022, 8:26am

you can throttle every of your 10 servers polling to pickup jobs at a small delay from one another.
Let’s say Server 1 kicks in a 0 ms. Server 2 at 10 ms. Server 3 at 20ms and so on.
If you can alocate numbers to jobs, you can also alocate from 1 to 10 and for Server 1 search for Jobs 1 and if not found search for other jobs. Something like “my jobs first”.
A simple algorithm can decrease the likelihood of 2 servers hitting the same job. Once in a while you can query for the count of remaining jobs and if the queue is increasing you can tune the throttling.
I am sure there should be many ways of efficiently hunting for jobs without overlapping.
If nothing works, you can move those jobs to a Redis DB and improve the R/W latency which is basically your present issue in Mongo.