Oplog tailing too far behind not helping

Hello - having an issue with our application where the CPU usage will keep spiking out of control and cause our app to 499 or 502 until it has recovered itself.

At first we had found that were pushing many operations into the oplog and we thought that was bogging down our system, so we set METEOR_OPLOG_TOO_FAR_BEHIND from 2000 to 250, and finally all the down to 100. As this went on, we saw a little bit of alleviation of the spikes, yet they were still happening pretty frequently over the course of a day.

Next, we had found that we had an update operation that was pretty huge - literally eating up all of our oplog, so we have batched that into smaller chunked updates that were also now appropriately diffing, so it drastically reduced the size of the update. Again - fixing this seemed to help somewhat with the server spiking, yet it was still happening much too frequently.

At this point, we have turned the oplog off entirely to see if we still experience the same spiking happening. This is less than ideal as we would love to keep the reactivity of oplog tailing, however if it keeps crashing our app servers, will not be worth it.

The fact that we had to turn down oplog tailing from 2000 to 250 and again to 100 without much improvement seems indicative that something else is going on within our application. After searching for over a week with enhancements being pushed and made almost daily, we are running out of ideas on what kind of things could be causing this. We reduced the operation sizes, optimized queries, and offloaded application code to no avail. If anyone else has any insight into this, that would be much appreciated.

Thanks!

2 Likes

Only possible fix is to move these high velocity data out from your MongoDB database using for Meteor.
When you do that, you can’t have reactive queries with those data. But it’s a win win.

3 Likes

@arunoda Thanks for the reply! This seems like a huge fault with meteor and a reason to move away from the framework. This is honestly intolerable because it literally makes horizontal scaling impossible. The quick answer to “does meteor scale?” should always be no because all adding extra servers does is increase the number of crashing servers. I’m not trying to be negative I am only being honest because this is an issue that @jkatzen, myself and my team have been facing for the past few weeks thinking that it was something that we could fix/optimize without moving away from meteor.

I got it.
But, that due to how Meteor observe works. (Anyway, meteor is trying to make MongoDB realtime. Which is not a realtime DB).

I can think of a solution, but not sure it’s easy to mention here. I hope there’ll be a easy solution soon.

May be @glasser can help us on any future plans.

3 Likes

@arunoda Thanks for the quick reply! We had these suspicions from the time we had started diagnosing what was going wrong with our application, but this pretty much solidifies that hunch.

While we knew about Meteor’s limitations from the get-go, we ended up maxing out our servers with pretty minimal load. As you said before, it is how the observers work and for our use-case unfortunately, observer re-use is very low. It saddens me that we will have to move much of our functionality off of meteor and our largest collections into a separate database, but I suppose that is the solution for now.

Yes. That’s gonna work. And then you can connect to that DB with meteor and still invoke queries.
We have few lessons on BulletProofMeteor on this. I will send you the links.

Note on using a separate DB: https://bulletproofmeteor.com/database-modeling/processing-high-velocity-metrics/4

And how to connect to a different DB and use it.

3 Likes

When you say high velocity here, do you mean that the app’s client or client(s) are designed to send a ton of data per second?

Just to be sure: you did set METEOR_OPLOG_TOO_FAR_BEHIND, with METEOR_, right?

@powderkeg We have a separate task server that is processing some data and then does bulk updates/inserts into our database anywhere from 100’s to 1000’s of ops/sec.

@glasser Yes - we did, just forgot to include it in the post sorry.

Out of curiosity, do those batch processed data updates and inserts need to update the database through the oplog? What happens if this bulk data processing is done during maintenance period where the system is cut off from other client access? Just wondering :slight_smile:

@powderkeg Our application needs to be available 24/7, so unfortunately there is no time during “maintenance” where it would be safe to do these operations. Also - users signing up or taking certain actions will kick off certain long running tasks as well, so there is literally no time of the day when these jobs aren’t happening.

If you’re using collection hooks then that’s your problem. Since migrating off of collection-hooks and the addition of OPLOG_TOO_FAR_BEHIND env variable I haven’t seen many issues, and I’m the one who originally brought up the whole idea of the OPLOG_TOO_FAR_BEHIND feature.

We have never used collection hooks in our project. In any case - this issue was persistent until we moved the high velocity data into a separate DB instance and things have been smooth since.

1 Like

With METEOR_OPLOG_TOO_FAR_BEHIND enabled is there any way to determine when the app makes the switch from oplog over to poll and diff? It would be great if there was a callback which fires when the transition is made in either direction so we could log the occurrence.