Oplog tailing too far behind not helping

jkatzen · March 31, 2015, 12:53am

Hello - having an issue with our application where the CPU usage will keep spiking out of control and cause our app to 499 or 502 until it has recovered itself.

At first we had found that were pushing many operations into the oplog and we thought that was bogging down our system, so we set METEOR_OPLOG_TOO_FAR_BEHIND from 2000 to 250, and finally all the down to 100. As this went on, we saw a little bit of alleviation of the spikes, yet they were still happening pretty frequently over the course of a day.

Next, we had found that we had an update operation that was pretty huge - literally eating up all of our oplog, so we have batched that into smaller chunked updates that were also now appropriately diffing, so it drastically reduced the size of the update. Again - fixing this seemed to help somewhat with the server spiking, yet it was still happening much too frequently.

At this point, we have turned the oplog off entirely to see if we still experience the same spiking happening. This is less than ideal as we would love to keep the reactivity of oplog tailing, however if it keeps crashing our app servers, will not be worth it.

The fact that we had to turn down oplog tailing from 2000 to 250 and again to 100 without much improvement seems indicative that something else is going on within our application. After searching for over a week with enhancements being pushed and made almost daily, we are running out of ideas on what kind of things could be causing this. We reduced the operation sizes, optimized queries, and offloaded application code to no avail. If anyone else has any insight into this, that would be much appreciated.

Thanks!

arunoda · March 31, 2015, 1:26am

Only possible fix is to move these high velocity data out from your MongoDB database using for Meteor.
When you do that, you can’t have reactive queries with those data. But it’s a win win.

khamoud · March 31, 2015, 2:13am

@arunoda Thanks for the reply! This seems like a huge fault with meteor and a reason to move away from the framework. This is honestly intolerable because it literally makes horizontal scaling impossible. The quick answer to “does meteor scale?” should always be no because all adding extra servers does is increase the number of crashing servers. I’m not trying to be negative I am only being honest because this is an issue that @jkatzen, myself and my team have been facing for the past few weeks thinking that it was something that we could fix/optimize without moving away from meteor.

arunoda · March 31, 2015, 2:36am

I got it.
But, that due to how Meteor observe works. (Anyway, meteor is trying to make MongoDB realtime. Which is not a realtime DB).

I can think of a solution, but not sure it’s easy to mention here. I hope there’ll be a easy solution soon.

May be @glasser can help us on any future plans.

jkatzen · March 31, 2015, 2:46am

@arunoda Thanks for the quick reply! We had these suspicions from the time we had started diagnosing what was going wrong with our application, but this pretty much solidifies that hunch.

While we knew about Meteor’s limitations from the get-go, we ended up maxing out our servers with pretty minimal load. As you said before, it is how the observers work and for our use-case unfortunately, observer re-use is very low. It saddens me that we will have to move much of our functionality off of meteor and our largest collections into a separate database, but I suppose that is the solution for now.

arunoda · March 31, 2015, 3:03am

Yes. That’s gonna work. And then you can connect to that DB with meteor and still invoke queries.
We have few lessons on BulletProofMeteor on this. I will send you the links.

arunoda · March 31, 2015, 3:21am

Note on using a separate DB: https://bulletproofmeteor.com/database-modeling/processing-high-velocity-metrics/4

And how to connect to a different DB and use it.

powderkeg · March 31, 2015, 3:58am

When you say high velocity here, do you mean that the app’s client or client(s) are designed to send a ton of data per second?

glasser · March 31, 2015, 4:02am

Just to be sure: you did set METEOR_OPLOG_TOO_FAR_BEHIND, with METEOR_, right?

jkatzen · March 31, 2015, 4:20am

@powderkeg We have a separate task server that is processing some data and then does bulk updates/inserts into our database anywhere from 100’s to 1000’s of ops/sec.

@glasser Yes - we did, just forgot to include it in the post sorry.

powderkeg · March 31, 2015, 4:37am

Out of curiosity, do those batch processed data updates and inserts need to update the database through the oplog? What happens if this bulk data processing is done during maintenance period where the system is cut off from other client access? Just wondering

jkatzen · March 31, 2015, 6:19am

@powderkeg Our application needs to be available 24/7, so unfortunately there is no time during “maintenance” where it would be safe to do these operations. Also - users signing up or taking certain actions will kick off certain long running tasks as well, so there is literally no time of the day when these jobs aren’t happening.

workman · May 27, 2015, 9:06pm

If you’re using collection hooks then that’s your problem. Since migrating off of collection-hooks and the addition of OPLOG_TOO_FAR_BEHIND env variable I haven’t seen many issues, and I’m the one who originally brought up the whole idea of the OPLOG_TOO_FAR_BEHIND feature.

jkatzen · May 28, 2015, 9:11pm

We have never used collection hooks in our project. In any case - this issue was persistent until we moved the high velocity data into a separate DB instance and things have been smooth since.

dsyko · September 22, 2015, 6:01pm

With METEOR_OPLOG_TOO_FAR_BEHIND enabled is there any way to determine when the app makes the switch from oplog over to poll and diff? It would be great if there was a callback which fires when the transition is made in either direction so we could log the occurrence.