Horrible degradation of performance using Galaxy and MongoDB Atlas

renanccastro · August 12, 2021, 9:33pm

Hi, Spyridon.

shouldn’t the response time for pub/sub stay low based on the redis server

That’s not exactly true! If your DB is not ok, redis-oplog won’t do much. On every change event, redis-oplog fetches a full copy of the changed document before propagating the changes. If your db is slow, this will be slow too!

On the other hand, if you saw degradation without changes in your code or changes in the audience, there is a big chance it’s in your DB.

hemalr87 · August 13, 2021, 5:46am

Going to add my voice to those having the scanned/returned ratio has gone above 10,000 issue (10,000 because I got sick and tired of getting this message too frequently at a lower limit). I’ve been trying to debug it for months - it doesn’t seem to cause any performance problems.

Hunch I’d like to get others’ thoughts on:
One idea I had was that this is being triggered by lots of count() operations. And anecdotally, reducing these has led to fewer triggers (I haven’t yet been able to completely remove/reduce the count operations to test this though and not sure I will be able to).

rjdavid · August 13, 2021, 9:45am

What did you replace it with? countDocuments()?

hemalr87 · August 13, 2021, 10:44am

No, I decided to store the counts in question (3 of them for each user) on the database instead (under the users collection).

It’s a dirty and ugly solution but provided a huge spike in performance. Particularly as the count() code I was using was via the performant counts package - so it would have been running the count every 30 seconds (the interval I had set). Multiply that by so many users and multiple sessions and it adds up.

There are still a couple of places where the count() is being used - but it isn’t feasible to store the more dynamic counts on the database. That would lead to dozens of counts per user, complicated code and a lot of surface area for the counts to fall out of sync. I haven’t figured out a solution to this yet

I have a hacky idea to reduce the number of count operations by reducing the reliance on performant counts - that may provide more clues as to whether count() is the culprit.

But I thought I’d put this hunch out there to see if everybody else getting these mongo alerts is also using count() and whether it’s a pattern.

Spyridon · August 13, 2021, 6:33pm

Question, if I’m using Redis Oplog, should the APM still be listing observations as “Using Oplog”? Example:

renanccastro · August 16, 2021, 3:15pm

Hi, Spyridon.

Our APM agent uses a custom logic for identifying oplog usage. This logic is also being used by redis-oplog to wrap with a custom observer.
In our APM code we have:

observerDriver = ret._multiplexer._observeDriver;
var observerDriverClass = observerDriver.constructor;
var usesOplog = typeof observerDriverClass.cursorSupported == 'function';
endData.oplog = usesOplog;

If you see, we only check if the multiplexer observeDriver has a function called cursorSupported in its class.

Redis-oplog implements the exact same interface: redis-oplog/RedisOplogObserveDriver.js at b2be29ddc2feee460d248a7e5daffd2883c4e512 · cult-of-coders/redis-oplog · GitHub, so APM will always see as if it’s the default oplog in the UI, but in reality, it isn’t, it’s redis-oplog.

tl;dr “using oplog” should be shown!

truedon · August 19, 2021, 2:52pm

Storing counts like this aint dirty, its pro. That’s how you get stuff moving fast. A count seems simple but it actually is very resource intensive, especially over a million records. Even with a 16 core it can run like crap.

You can store thousands of counts and stats, especially even for data that is filtered, ordered and going into different time frames - seconds, hours, months etc just use memcache.

make keys like aLabelYouSpecify_userid_orderedby_timeframe_etc_etc_etc and chuck them in memcache. It loves it.

memcache performs significantly better then redis, you just gotta know how to use it like this to get the maximum power out of it. When you know to use memcache you will have no need for redis, it’s a over engineered keystorage imho.

a4xrbj1 · August 20, 2021, 7:24am

So how do you guys ensure that the count is correct (the one that is stored) and you need to trigger a new count() due to insert/delete/update?

truedon · August 20, 2021, 8:40am

In pseudo code:

If cache exists
Use cache
If cache doesn’t exist
Get count and store to cache,
set ttl for X minutes

Cache expires automatically after X minutes

You run the above either on user interaction, or if you prefer you can do it in cron to always have a pre-warmed cache.

This design pattern can be applied to any request for data, it is a caching layer. You can also update the cache whenever you want by overwriting it. So in the case of an update or insert you can then also run a set action subsequently in the same block of code.

For example in pseudo code:

On user action
Update database with new data,
Overwrite relevant cache keys

belfigue · January 22, 2023, 10:41pm

@waldgeist, @dokithonon, @vooteles, @hemalr87, and anybody else experiencing the scanned/returned ratio has gone above 1,000 issue: have you solved this?? We receive tens of email alerts from Atlas every day. We can up the Alert threshold to 10,000 as @hemalr87 did to stop them, but then I worry we will miss “real” alerts in the future. I can see from the logs that the Oplog tailing is the culprit, but that is all – complete mystery. We are hosted on Atlas with an M10 and running on MongoDB 5.0.14 and Meteor v2.10.0. We’ve been experiencing this issue at least since we upgraded to Meteor v2.7.3.

hemalr87 · January 22, 2023, 11:04pm

Are you using Redis oplog?

If you have identified the culprit as oplog, then that is a quick and easy fix. In our case, we are consulting with the atlas team in the next couple of months.

It hasn’t had any issue as such, but I’m pretty sure the culprit in our case are count queries.

waldgeist · January 23, 2023, 7:52am

No, unfortunately I was never able to resolve this. It’s still nagging me when I see these e-mails coming in.

vooteles · January 23, 2023, 4:42pm

Same for me, did not find a solution. It did not cause any issues for the app though, so did not contact Atlas’ support back then. Would be great if somebody identifies what causes this.

belfigue · January 23, 2023, 7:00pm

Thank you, all, for the updates.

@hemalr87 We aren’t using RedisOplog. I’m not familiar with it. Do you think we should switch to RedisOplog? As for the culprit, I thought it was the oplog tailing because I could see from Atlas’ logs that the alerts coincide with “collection scan” operations from the oplog that are examining ~10,000 docs. However, I’m now noticing that these queries are returning 0 docs. So I’m a bit confused. I’m pasting ashortened example of the oplog query log below.

{
    "c": "COMMAND",
    "msg": "Slow query",
    "attr":
    {
        "type": "command",
        "ns": "local.oplog.rs",
        "planSummary": "COLLSCAN",
        "keysExamined": 0,
        "docsExamined": 9238,
        "numYields": 10,
        "nreturned": 0,
        "reslen": 228,
    }
}```

hemalr87 · January 24, 2023, 12:09am

Do you think we should switch to RedisOplog?

If the oplog is the culprit then this is likely to solve it, yes. If not, then it won’t resolve the issue you are having but may still be something you wish to do (if not straight away then sometime down the line).

Looking at all the other responses here, this seems to be a common occurrence? Or just confirmation bias from the nature of this thread?

If it is common, I wonder whether it is:

Something underlying in how Meteor interacts with Mongo or
A common pattern misstep we are all making in how we are querying the database?

I find it weird that all of us have this issue, all of us are bothered by the alert(s), none of us have found the cause and none of us have any major issue with performance in spite of this alert.

For my part, we are trying to determine whether count() queries are causing the issue. Is it possible the same is happening with your applications @waldgeist @vooteles ?

vooteles · January 24, 2023, 12:45pm

I don’t think I had any count queries running at the time the notices appeared. However, this was already a while back and unfortunately can’t recall much detail. Had other more pressing issues to deal with back then and this issue mostly got neglected as no usability problems appeared. I do recall that I first started seeing those alerts when I tried AWS EBS (via the mup plugin). Having multiple containers running side by side probably brought the issue above Atlas’ limit for alerts.

Spyridon · January 25, 2023, 6:53pm

For me, eventually Atlas has just stopped sending those alerts. Nothing had changed about the queries in that time (we’re actually working on a completely new version of the software). I guess maybe they got tired of sending the notifications? lol.

belfigue · February 4, 2023, 2:26pm

We just upgraded a different database from a shared cluster to a dedicated M10 one and the alerts immediately started popping up.

lc3t35 · February 18, 2024, 9:04pm

Right now, each publish function blocks all future publishes and methods waiting on data from Mongo (or whatever else the function blocks on). This probably slows page load in common cases.

Found this comment in ddp-server/livedata_server.js (_runHandler line 1144. - 2.14 or 3.0beta) : this explain the locks and waiting time ?
Is there a way to fix this ?

Spyridon · February 21, 2024, 5:38am

Been awhile, but for me, the problem was resolved on the database side.

For me, working with Atlas support , we discovered that Atlas had some issues when using their +srv connection strings. Long story short, the extra DNS step of the +srv was causing an incredible amount of lag.

Support gave me the full connection string, and that immediately resolved the problem.

So the first thing I would try in your case is get a full connection string, not one that uses +srv. See if that resolves your problem.

Good luck!