Mongo Scaling Issues

elie · March 16, 2017, 10:14pm

It’s not always clear from Kadira what is taking a tonne of time. Are you
sure no colscans are happening in a db lookup? How are you running mongo?

Also how many instances do you have running. If it’s only one and I’d add
at least one more so that you can easily scale horizontally if you need to.
Scaling to 2 is as hard as scaling to 20.

Elie

evolross · March 17, 2017, 4:00pm

I’m running mongo on compose.io. I have one app instance on Galaxy (but I have two apps that use the same database). Only one of the apps is having issues though because that’s the one that has all the users. The other is the “admin” app that has few users.

My CPU usage is really low. Is adding more instances going to help? Per your article you say not to do that as it puts more strain on mongo. All my slowness is from pub-sub so it’s all having to do with mongo and my publications. I emailed compose.io and they said they didn’t see any COLLSCANS on my database.

I’m seeing a few other posts around with this same issue of meteor.loginServiceConfiguration and meteor_autoupdate.clientVersions but no solutions. I am on Meteor 1.1.0.3… perhaps updating my app would be a good box to check. I may have to do this anyway to use redis-oplog if I can’t solve this myself.

elie · March 18, 2017, 5:09pm

I should change that line in the article then if that’s what people are
taking from it. You should definitely be using multiple instances. You’re a
long way away from needing Redis oplog. When you attach 80 instances to
mongo that MIGHT become an issue. At this point it most certainly isn’t.

You should read the first article. The first thing to do to scale meteor is
move to multiple instances. I still don’t know what your specific issue is,
but 100ish connections on a single meteor instance will very likely strain
it. Even 50 and you’re getting into dangerous territory.

evolross · May 26, 2017, 3:56am

Wow… still frustrated. So It’s been a few months. I updated my app to Meteor 1.2.1 (will be going to 1.4.x in the next few months). I stripped out ALL unnecessary reactivity except for one main object with a few reactive fields needed. I was like Ripley with the flame thrower - everything that didn’t need reactivity was retooled using Meteor methods and works well.

And still I’m getting super-long thirty and forty second pub-sub wait times when my app ramps up to 100-200 users. I even had three containers going when this happened. See the below images taken from Meteor’s new APM tool based on Kadira:

And again, my pub-sub is waiting on meteor.loginServiceConfiguration. Notice my Meteor method times are all really fast. Here’s the stack trace again pointing to meteor.loginServiceConfiguration:

I’m still convinced this is some kind of side-effect of something else being wrong. I noticed my observeChanges have four and five second wait times as well. I’m going to keep hammering on things and see if I can figure out what the issue is. I have a few more things I want to try. Any ideas from anyone? Anyone having similar issues?

XTA · May 26, 2017, 4:49am

Did you check slow query logs on MongoDB to see if you miss some indexes?

vigorwebsolutions · May 26, 2017, 4:56am

Or checkout out redis-oplog?

elie · May 26, 2017, 7:22am

This looks like a classic missing index. Find all the colscans happening in
your db. How big are the collections?

evolross · May 26, 2017, 5:01pm

I’m confident it’s not my indexes. I’ve went over them several times. All my Meteor methods are running super fast. So my objects are being queried fast via Meteor methods.

I removed service-configuration from my app for now, going to see if that helps and/or throws the delay onto something else.

I’ll be testing out redis-oplog in another month or two, however my delays are both being caused by Meteor packages… not really my own objects/subscriptions.

nadeemjq · May 27, 2017, 10:08am

Are you using compose?

If so, ask them to turn on logging for slow queries.

That should reveal a lot, including whether or not those queries were using index.

I had one AHA-moment when I realised that even though my indexes were set, merely querying in the wrong sort order caused index misses. There are a lot of gotchas in the world of Mongo!

SkyRooms · May 28, 2017, 12:48am

Hi. I just did a LOT of research with Kadira and Mongo optimization.

Have a look at my post here, maybe it’s of use to you.

SkyRooms · May 28, 2017, 12:49am

You’re running Mongo with OpLog enabled in a minimum 3 cluster instance, yes?

gothicmage · May 28, 2017, 10:51am

Still using redis-oplog and grapher.
Optimizing queries really saves nerves, not to mention reactive collections are overkill quite often, grapher a lows for easy switching them into methods.

evolross · June 15, 2017, 6:44am

So I had another 150 or so users hit my server again.

I had removed service-configuration from my app. And as I predicted, the long wait times have shifted over to meteor_autoupdate_clientVersions. See the below stack traces where I’m seeing 92,601 ms wait times. The good news is on the stack trace specifically for meteor_autoupdate_clientVersions all the wait time is located in one or two of the async calls. I tried Googling about this but I’m not finding much. Does this have something to do with not calling this.unblock() correctly? Or something along those lines of Meteor/Node blocking for some reason? I could very well be doing something wrong here. I’ve experimented with these calls before but didn’t notice much difference.

@XTA @elie @nadeemjq I had help from compose.io about turning on slow query logging back in March and I had it enabled during this incident. The good news is I barely have any slow queries logged - 19 total actually. Most are from April where I did actually have a missing index issue. The last few were in early June a few days before this incident but they’re unrelated. ALL of my slow queries were less than about 150ms, so they’re not the issue. This slowness is not a database issue.

EDIT: I guess this is no longer a “Mongo Scaling Issue” per the thread. Though I still haven’t solved what the actual issue is.

@SkyRooms Yes I’m running Mongo with OpLog enabled. Sometimes I run a single node, sometimes I run two. Was doing some experimenting. For these latest benchmarks I’m running two.

elie · June 15, 2017, 8:35am

Well firstly I’d open up a new thread about it. Might get more responses to
the specific issue at hand.

SkyRooms · June 15, 2017, 2:55pm

Re-read the post. I see what you’re saying, login services takes forever to respond.

If it were me, I’d try meteor create version2 and import all my stuff. See if that helps with fresh packages.

I know after a while you end up with more crap thrown in a project than you realize. Start with a fresh cup of coffee and fresh project. See if that helps.

#Debugging…

evolross · June 16, 2017, 10:23pm

I’m still wondering if this issue in this thread https://github.com/meteor/meteor/issues/4559 is my problem. This is strange because my app doesn’t even require the user to login. I don’t have any of the accounts packages added.

EDIT: Looks like I am using accounts-base@1.2.2 Going to look into removing this and see if my app still works as I’m not explicitly using any of the other accounts packages.

elie · June 17, 2017, 6:14pm

Interesting. I’ve also seen methods waiting on login a lot, but it was
usually part of a bigger issue somewhere else where the entire database was
being held up by a missing index.

BTW, why don’t you just update the entire app to 1.5? What’s holding you
back?

evolross · June 23, 2017, 1:07am

TL;DR: OPLOG and/or Meteor poll and diff is the issue. Will integrate redis-oplog in three weeks.

After a lot of thread-reading and Kadira metrics analysis, I think I figured out what the issue is… it’s the classic circa-2015 “OPLOG clogging everything up after about thirty users issue”. And if by any chance it’s not the OPLOG, then it’s the Meteor poll and diff that’s getting me. Sadly, it’s taken me forever to realize I’m having a problem with something that was really well known for the last two years.

I finally found some really helpful threads and realized my app was doing “high velocity” writes. All my users send an update at almost the exact same time, what’s more, they’re all updating the same document. I found a post by MDG about how Meteor 1.3 has some new tools for disabling the OPLOG and how a large number of writes on the same object can create a hotspot without maxing out the CPU which was a phenomenon I was seeing. The following pages were really helpful:

OPLOG: A TAIL OF WONDER AND WOE (Very similar to my use-case)
Meteor OPLOG flooding (Great forum thread)
Tuning Meteor Mongo Livedata for Scalability
My experience hitting limits on Meteor performance (Also similar to my use-case)

I did a “deep dive” into my Meteor APM (Kadira) data for this latest server slowness and I started noticing a lot of funny things. I found stack traces where meteor_autoupdate_clientVersions was doing a bunch of observeChanges that were really slow. A lot of observeChanges in my app were fine with only a few users but once more than twenty or thirty logged in, the same observeChanges slow to a crawl.

The issue with meteor_autoupdate_clientVersions and meteor.loginServiceConfiguration having long wait times is indeed as I suspected merely a side-effect. If I understand correctly, both of those calls are synchronous on each client… they don’t use this.unblock(). So if my app is loaded up with 150 users and the OPLOG is clogged to all hell and back, those new users who log in get stuck waiting on meteor_autoupdate_clientVersions and meteor.loginServiceConfiguration which is in alignment with what my users were seeing in the app. And there’s always new users logging into my app because my app is very bursty - i.e. it goes from 5 users to 150 or more in about a minute.

Another huge gotcha I discovered is my admin app was actually creating/getting a ton of OPLOG traffic from my non-reactive mobile-client app. My app is actually two apps. An admin app and an end-user mobile-client app, both connecting to the same database. In the mobile-client app, I flame-thrower’d all the reactivity out of it (per my input above) and didn’t understand why the OPLOG was still killing me… well, I didn’t remove any reactivity out of the admin app because there’s only ever maybe five users at any time using the admin app. However, if my assumptions are correct, those five users on the admin app are all processing the reactivity of my 150 mobile-client users. So all those interactions all happening at the exact same time on the exact same object are being multiplied by however many admin users I have. And also still a little bit on my mobile-client app because I had to leave a little bit of reactivity in it. So this was something I didn’t consider… that Meteor was still having to process the OPLOG for my admin app users.

Another weird little gotcha is MongoDB shows no slow queries. Which can’t be true because Meteor APM (Kadira) is showing me all kinds of find() and fetch() calls (i.e. a Meteor findOne()) that are taking 3000+ ms. I checked my slow queries on my PRIMARY mongo node and it showed only about 19 slow queries, each no more than 200 ms, all from several months ago… weird. I checked my SECONDARY mongo node which is what enables OPLOG and the slow queries was empty. BUT, the compose.io help team told me this table can occasionally reset itself when a resync is performed. So there’s a very good chance I’m likely not seeing my OPLOG slow queries being reported. Not sure how often a resync happens between the nodes… seems fairly often as I checked the SECONDARY for slow queries a few days after this latest case.

So now, I’m going to build some diagnostic testing tools to simulate all those users so I can create this storm of activity at any time on my own. Then I’m going to disable the OPLOG and see if I still have this issues without it. If so, then I’ll have to figure out what to do about that. If not, I’ll very likely try to start integrating redis-oplog which makes me very happy because I removed a lot of cool interactivity trying to solve these issues. It would be great to get it back. My only other option is to try some of the Meteor 1.3 disableOplog options on publications (I’m still on Meteor 1.2.1). I’m sure those would do the trick, but then again, I’m losing some responsiveness. In about three weeks I’m going to jump on this full-time and will report back my findings and results.

So this still could very well be a “Mongo Scaling Issue”… more to do with the OPLOG, but still Mongo involved also.

@elie I haven’t upgraded yet because this is for a production app. In the past, whenever I upgrade I always have a package or two that breaks. That was when I was upgrading right away, so now I’ll probably be okay. But again, with a production app, you have to have the time to test and work out any issues, which at the moment I’m short on. I’ll upgrade to 1.5 hopefully in about three weeks.

SkyRooms · June 23, 2017, 2:43am

You should search the forum here, I did an EXTENSIVE test of Oplog for my MMO game.

Oplog is a REQUIRED Meteor methodology. At least it should be. There should be warnings in a billion places while putting it on production.

Very short sited by Meteor, but they’ll probably have a Galaxy based solution in the future.

evolross · June 23, 2017, 5:45am

Yeah I kinda did search the forum… a lot. That was the point in linking all those threads.

What did you use for testing OPLOG? How did you simulate users?