[SOLVED] Problem using atlas oplog connection

tcastelli · August 21, 2019, 9:34am

Hello people, we’ve been experiencing a problem on different meteor apps (all at 1.8.1) since yesterday, which makes apps stop responding due to the massive amount of oplog “catch up” query failures. The only thing we could see in the logs is Got exception while reading last entry of undefined which exploring the meteor code takes us to https://github.com/meteor/meteor/blob/e0caf13103e21ceca9c1a6616f67dac8f97c5c48/packages/mongo/oplog_tailing.js#L154

By searching at the forums, I found some old posts mentioning that $natural might not be supported by oplog, but in the end nothing was concluded, so maybe someone has already experienced this and tell us what is going on. The setup ahs been working wonderfully for some time now, so not sure why it would trigger this error suddenly? (atlas limitations on the oplog?)

For now, disabling oplog did the trick to make the apps work again, but would like to get to the bottom of the problem and restore the oplog asap.

Thanks!

UPDATE: I have tried to run the query against our oplog database and indeed i get an error

Mongo Server error (MongoQueryException): Query failed with error code 8000 and error message 'error going deeper into doc error going deeper into doc bad type going deeper into array i' 

looks like this part of the query is the one producing the error
{ op: { $in: ['i', 'u', 'd'] } },

and what’s more strange, if i replace that part for
3 different or entries with op:“i”, op:“u”, op:“d”, then it works

UPDATE2: This is issue is now solved. Turns out Atlas had changed something on M0 and M2 deployments that was causing the error.

andruli · August 21, 2019, 12:46pm

Hi @tcastelli we are using atlas and started experiencing the same issues yesterday night . I replicated your optlog findings. We’ll be reaching to cloud atlas support and we’ll post back here if we get any news.

jcha · August 21, 2019, 2:30pm

hi all, same here.
my servers are in Galaxy n MongoDB Atlas
what I observed in Atlas monitoring is it produces so many connections until it reachs the connections limit.
After I restarted my server in Galaxy, the connections seem to be normal to Atlas but still produces 100 plus Queries/sec and I am quite sure there is no infinite loop or something like that in my query.
Will reach cloud atlas support too.

jcorner · August 21, 2019, 2:50pm

Hi,
We’re experiencing the same issue using on our staging m0 cluster. However, on our m10 the issue doesn’t seem to appear. Also contacted Atlas support but I’m also looking forward to hearing your solutions!

drone1 · August 21, 2019, 2:53pm

+1

Site is down and sad.

jam · August 21, 2019, 3:55pm

Experiencing the same

jam · August 21, 2019, 5:37pm

Just received this from Mongo support:

Apologies for the trouble. This is a known issue that we are currently working to fix. We will continue to monitor the issue and let you know as soon as this is resolved.

goofiw · August 21, 2019, 6:09pm

I’m also running into the
MongoError: error going deeper into doc error going deeper into doc bad type going deeper into array i
error. The only big change we’ve made is adding a rest server to handle some graphql calls from the app. This is the only db call we added, and things are working fine locally.

We’re using simpl-schema, Mongo 4.0.12 Enterprise
:

AModel.aggregate({
  {
    $match: {
      participantIds: userId,
    },
  },
  {
    $lookup: {
      from: 'users',
      localField: 'participantIds',
      foreignField: '_id',
      as: 'participants',
    },
  },
  {
    $lookup: {
      from: 'Activities',
      localField: 'activityIds',
      foreignField: '_id',
      as: 'activities',
    },
  },
]);

Does anyone think this relates? or maybe I should open another ticket?

andruli · August 21, 2019, 10:11pm

The issue seems to be resolved now .

minhna · August 22, 2019, 5:04am

wow, that’s very good news. I’m thinking of moving my db to atlas.

jcorner · August 22, 2019, 7:49am

Also got the message from MongoDB Atlas support that the issue is solved and indeed the issue seems to be gone on my deployment