Beef up Meteor/Mongo to handle expensive bulkUpsert with concurrent active Sessions

sarojmoh1 · December 3, 2021, 7:54pm

Hi!

Every morning our app refreshes ~30k+ (~30MB JSON) documents (and this can continue to grow).

It gets a lot of data per each doc from API reqs
2. Does a huge bulkUpsert using rawCollection

This has recently caused a Mongo stack overflow error upon attempting the bulkUpsert. I have suspicion to believe it’s because some users have pages open to the app that have some inefficient subscriptions to some of these 30k+ docs and that these things happening in parallel are causing server to crash.

My exact error is

/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/utils.js:691
 throw error;
 RangeError: Maximum call stack size exceeded
 at ServerSessionPool.acquire (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/sessions.js:623:12)
 at ClientSession.get serverSession [as serverSession] (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/sessions.js:113:47)
 at /var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/sessions.js:148:19
 at maybePromise (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/utils.js:685:3)
 at ClientSession.endSession (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/sessions.js:130:12)
 at Cursor._endSession (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/cursor.js:392:15)
 at done (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/cursor.js:448:16)
 at /var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/cursor.js:536:11
 at /var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/utils.js:688:9
 at /var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/operations/execute_operation.js:82:7
 at maybePromise (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/utils.js:685:3)
 at executeOperation (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/operations/execute_operation.js:34:10)
 at Cursor._initializeCursor (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/cursor.js:534:7)
 at Cursor._initializeCursor (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/cursor.js:186:11)
 at Object.callback (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/cursor.js:439:14)
 at processWaitQueue (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/sdam/topology.js:1049:21)
 at NativeTopology.selectServer (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/sdam/topology.js:449:5)
 at executeWithServerSelection (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/operations/execute_operation.js:131:12)
 at /var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/operations/execute_operation.js:70:9
 at maybePromise (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/utils.js:685:3)
 at executeOperation (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/operations/execute_operation.js:34:10)
 at Cursor._initializeCursor (/var/app/current/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/cursor.js:534:7)

and I’ve more described on Node call stack exceeded from Mongo bulkOperation

Just wondering if there is any potential quick fix here by beefing up either the Meteor Server and/or MongoDB Server so that it can handle the above situation (if in fact that is the error)?

Otherwise, I’m thinking of just delegating this morning sync to a separate server specifically made for running SyncedCron jobs. Any insights and recommendations on easiest way to do this? I’d rather not rewrite any code, but all I really need is my SyncedCron module and having the server point to the same remote Mongo instance (there’s just 1). I don’t need a full copy of the app serving client-code of course. I still want to use Meteor Server as I’m relying on proprietary Meteor code, collections, etc…

harry97 · December 4, 2021, 6:28am

Otherwise, I’m thinking of just delegating this morning sync to a separate server specifically made for running SyncedCron jobs

My thoughts exactly, I think this will definitely help alleviate the problem. As to how you should do it, I’ve not done this before but maybe look into Serverless as it can help reduce the costs? Maybe something like this but instead would fetch and insert into the DB. But I think there’s still some investigation to do before rushing to solutions.

You’ve mentioned that there’re subscriptions that are draining your app and causing it to overload.

I have suspicion to believe it’s because some users have pages open to the app that have some inefficient subscriptions to some of these 30k+ docs and that these things happening in parallel are causing server to crash.

Did you try to cut down on the number of documents that gets fetched/updated to see if it runs smoothly? What subscriptions specifically? Honestly if real time isn’t crucial to your business model, replacing it with methods is the way to go. Many Meteor applications must do this in order to scale as pub/sub is an expensive that shouldn’t be used when it’s not needed.

sarojmoh1 · December 14, 2021, 2:56pm

I agree with you wholeheartedly that investigation should be done before rushing. There’s a lot to be refactored with the inefficiency of these subscriptions and so I was kinda hoping for a quick temporary fix.

Moving away from pub/sub to using methods makes sense, pub/sub isn’t really needed in lots of places in the app but I inherited this codebase. Tough to refactor everything being the only dev. But…one thing at a time.

I love your idea about serverless architecture to reduce costs. I’ll have to look into it. We’ve adopted an AWS stack and I’d like to keep whatever solution I go with in there.

ddp-health-check seems cool, although I’d have to play around with it to see if it’ll work well with my problem.

I don’t mind setting up another Meteor instance just to handle these syncs. Challenge is I just want a fraction of my whole codebase (not client code)…but I’d need some Collection classes, helpers, etc… to prevent any sort of refactoring

harry97 · December 15, 2021, 9:19am

I think you’ve a good idea of what to do next and you may want to consider this during your pub/sub refactor. Good luck!

truedon · December 16, 2021, 6:51am

Best to not do this at all in Meteor.

Make a node script (or any language you want) outside of the app that handles your data functionality, do not use upsert the performance is not good at all.

For mongo (and really any database engine) You must use insert and then delete the old rows using a indexed timestamp in order to work with large datasets of over 1 million records (even more then 100k you’ll not want to be doing updates)

Updates (upserts are updates) require two queries to execute for every command, first a select to get the row and then the update itself. To reduce this you must insert the data and delete the old. On the display layer have a conditional where it only selects the newest and voila you have a working solution good for as many rows as you require (into the billions)

znewsham · December 16, 2021, 11:46am

Are you doing these 30k operations in a single bulk insert? A trivial fix might be to batch these into say 10k item batches. I’d be a little surprised if the observers cause a stack overflow, in general the observer is either iterative. Or asynchronously recursive, which I don’t think can cause a stack overflow since it isn’t true recursion (this might be different when combined with fibers though). If it was the observer id expect long running meteor servers to get these stack overflow errors more often

It’s worth noting that even just arr.push(...[lots of items]) can trigger a stack overflow, so anything that behaves like that (which bulk operations might) could cause that

Similarly, you may find you need this even without the stack overflow, I’m not 100% sure how the mongo driver handles bulk operations internally, if it gets serialized into a single mongo command you’ll be limited by the 16mb bson document limit.

sarojmoh1 · December 16, 2021, 6:33pm

I put all 30k into a bulkOperation and execute it in 1 bulk.execute. Mongo says that it automatically chunks per 1k docs though.

That being said, I do have some code away that manually chunks and executes separate bulk.executes’ I haven’t tested this though.

My theory is that it’s something tandem with the observers and the execute being ran simultaneously.
Please look through the stacktrace again. I put some logging in some of those files and verified that it was continusoly happening (as opposed to just once for the single bulk.execute call). That’s why I’m leaning on it being due to Observer.

Please look at Under load, mongoDB queries may result in maximum call stack exceeded error - #2 by Patrick_Schubert - Drivers & ODMs - MongoDB Developer Community Forums . OP there is a CTO and had virtually same exact issue. I emailed him privately and he said issue just stopped happening (he thinks after a Mongo update)

I.e. I don’t think a bulkOperation of 30K being executed in a vacuum will cause the stack overflow. But I could be grossly wrong.

truedon · December 17, 2021, 4:11am

From personal experience, it can’t handle over 1k in bulk even with a 16 core overclocked gaming rig with 32gb ram on freebsd it was still going into loads of over 10

But if you just do insert and then remove, doesn’t even get warm barely notices it

sarojmoh1 · December 17, 2021, 2:21pm

You’re saying you cannot do something like

const bulk = Collection.rawCollection().initializeOrderedBulkOp();
for (let doc of moreThanThousandDocs) {
    bulk.find({_id:doc._id}).upsert().updateOne({$set:doc});    
}

(async () => {
    try {
        bulk.execute();
    } catch (error) {
            console.error('bulk error', error);
   }
})();

That’s really crashing for you on a current Meteor/Mongo setup on your rig with over 1k docs? What size is the full payload that crashes?

truedon · December 17, 2021, 3:15pm

Nope Nothing crashes just the load is insane, although my update is across 10 fields and I match on one indexed key it’s just take my loads up to over 10, I need to update a growing collection of 1.4m records every 33 seconds from a multithreaded process, so batching into 1000s was the sweet spot after alot of testing.

I use the bulk.update syntax though directly and don’t do the find, it’s slightly different, not sure if that affects it. With bulk.update you can add each update into a bulk object then call execute at the end of iteration. Mongos internal system defaults to split down operations to 1k per operation so that’s probably why it seems to handle it fastest when I have it set at 1k batches. Then it runs smooth with the rest of the process that’s doing the acquisition part for the next batch. End result is load is now 0.1 huge difference and means we can run everything on one server including the actual App and API

sarojmoh1 · December 17, 2021, 6:45pm

Very interesting! I haven’t profiled it yet and don’t know what the load is. What unit(s) are you referring to when you talk about load?

My update is over like 30 fields and not 10.

What kinda app are you running and how are you profiling your server? Thanks!

truedon · December 17, 2021, 7:24pm

In Linux your Load average is displayed in top, just run top at shell to get the output and it’s on the first line, here’s an article with more info. I monitor in top mainly or at dns level or with rrd in cacti. My app is meteor, react and has a API which powers a plug-in that gets around 120k daily uniques

https://opensource.com/article/18/3/tips-top-monitoring-cpu-load-linux

sarojmoh1 · December 28, 2021, 2:50pm

Do you actually shell into your Prod server and run top? What do you mean by monitoring at the DNS level?

Fwiw, I monitored top while hitting the most expensive subscription and saw that it was actually the cpu% for the node process that spiked from like 0.7% to 100+% (across 2 cores though). Mem didn’t really exceed over 15%. So…seems like cpu could actually be the issue.

truedon · December 28, 2021, 3:57pm

Yes, so you need to add more cores or find a more efficient way to use Mongo.