Good strategies for debugging MongoDB's Oplogs?

We’ve been having some performance issues, and we’ve managed to link that to our publications. We managed to make some improvements, but it seems there’s still room for more. The problem is that I’m not really sure on how to research the topic further, as I don’t know the best way to debug Mongo’s Oplog to see what’s taking a tow on our CPU.

I know there are technologies like Redis Oplog, but without knowing more details about our Oplog usage we can’t know for sure if the problem is our internal logic or Meteor`s Oplog system.

Are there any reliable ways to degug oplogs that would allow us to see metrics about the Oplog notifications and what updates are generating them? Any other suggestions that would allow us to improve our understanding of how the Oplogs work in our app would help a lot as well.

We currently use MontiAPM, which allow us to see which Publications receives the most oplog notifications, but it doesn’t allow us to dive deep into what’s generating those notifications.

I call the Radek of the polish!!

1 Like

There’s no way to know what operation led to an oplog entry (or a change stream event). If you have a really low traffic, you could try to match its clusterTime to the slow query log (with a lower threshold if needed). But let’s be realistic – that’d take ages to go through.

What you could do, is to go through the update change stream events, group them by clusterTime (with a reduced resolution), ns, and operationType. With that in hand, you could check the “busy” times (i.e., more than X entries in one slot) for patterns in the updateDescription. That’s still a rather manual process, but if the peaks are really high and infrequent, they should be easy to spot.

(Now I’m thinking that it could be automated… Let me know if you’re interested in such a tool.)

2 Likes

And one more idea: wrap update in your app. It could check the number of modified documents and if it’s high enough, log the query and modifier. Note that it’ll work only if you’re not using $merge or bulk writes.

1 Like

@radekmie thanks for the suggestions, they look great. Even if they involve some manual process, maybe we could write some scripts to help us with that, I suppose.

However, I’m still lost on some details.

  • “you could try to match its clusterTime to the slow query log” - clusterTime of what? The oplog entry? If so, how am I supposed to access them? Directly from the database or using Meteor somehow?
  • “What you could do, is to go through the update change stream events” - Is this still related to the oplog entries? If so, how do I detect that an event is a update? Is there a good docs for it somewhere? I searched through MongoDB’s but it got me confused

I know I’m asking a lot of details, but searching through Mongo’s docs didn’t lead me to something useful. Any tips on how to continue my research is great appreciated :slight_smile:

Thanks!

Of the change stream events. Oplog entries also have it, but it’s not documented at all.

Directly from the database. You can start a change stream from a given timestamp and analyze it.

Change streams are kind of a replacement for the oplog, at least from the user perspective (oplog remains the internal structure). You can read more about it here: https://www.mongodb.com/docs/manual/reference/change-events/update/.

1 Like

Thanks for the information! I’ll start digging into the sources you shared :wink: