Oplog infinite loop on Galaxy with Overflow sort stage buffered data usage MongoDb error

I have recently observed a new oplog observing failure in my production Meteor application (v1.8.1) running on Galaxy and a MongoDB Classic database (v3.0) on Compose.io (all running OK for 4+ years at this point). I would be very grateful for any pointers on possible actions.

Incident

I noticed that realtime publication updates had stopped working on the application; looking at the application log on Galaxy immediately revealed mongodb exceptions being reported every one second … see one example below. I bounced the (single instance Conpact-Pro) server and the problem stopped, and has not yet re-occurred after almost 24 hours. I have been so far unable to determine when the logs started reporting this issue (it’s very time-consuming to page back through the logs on the Galaxy UI), but I did notice that it was reporting errors every second up to 23:59:59 the previous day, then did not start again until 09:04 the next morning.,

Investigation

I understand the Overflow sort stage buffered data usage exception is related to an in-memory sort query without utilizing an index (MongoDb documentation link). However, there is no clue in the error trace which publication query caused the problem (my application has several publications with a sort, plus maybe others in 3rd party packages).

I looked in the oplog_observe_driver.js code segment that prints the Got exception while polling query error, which seems to retry the query forever if it gets to that conditional path that prints that error?

      var cursor = self._cursorForQuery({ limit: self._limit * 2 });
      try {
        cursor.forEach(function (doc, i) {  // yields
          if (!self._limit || i < self._limit) {
            newResults.set(doc._id, doc);
          } else {
            newBuffer.set(doc._id, doc);
          }
        });
        break;
      } catch (e) {
        if (options.initial && typeof(e.code) === 'number') {
          // This is an error document sent to us by mongod, not a connection
          // error generated by the client. And we've never seen this query work
          // successfully. Probably it's a bad selector or something, so we
          // should NOT retry. Instead, we should halt the observe (which ends
          // up calling `stop` on us).
          self._multiplexer.queryError(e);
          return;
        }

        // During failover (eg) if we get an exception we should log and retry
        // instead of crashing.
        Meteor._debug("Got exception while polling query", e);
        Meteor._sleepForMs(100);
      }

Could it be that a long-running observer sort query initially succeeded but at some point the underlying data exceeded the 32Mb limit, and then started failing every second? If so, I wonder what happened at midnight to stop the repeated retry.

It would also be useful if the Meteor._debug statement also printed out the offending query details, to help determine which query might be causing the underlying > 32Mb sort query issue.

Any helpful suggestions on ways to proceed would be gratefully received!

Log exception example

Got exception while polling query MongoError: getMore executor error: Overflow sort stage buffered data usage of 33557607 bytes exceeds internal limit of 33554432 bytes
p4r4v
2019-11-18 16:07:44+00:00 at /app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:575:34
p4r4v
2019-11-18 16:07:44+00:00 at handleCallback (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:204:5)
p4r4v
2019-11-18 16:07:44+00:00 at _setCursorNotifiedImpl (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:433:38)
p4r4v
2019-11-18 16:07:44+00:00 at self._endSession (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:441:46)
p4r4v
2019-11-18 16:07:44+00:00 at Cursor._endSession (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:195:5)
p4r4v
2019-11-18 16:07:44+00:00 at Cursor._endSession (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/cursor.js:226:59)
p4r4v
2019-11-18 16:07:44+00:00 at _setCursorNotifiedImpl (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:441:17)
p4r4v
2019-11-18 16:07:44+00:00 at setCursorNotified (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:433:3)
p4r4v
2019-11-18 16:07:44+00:00 at setCursorDeadAndNotified (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:426:3)
p4r4v
2019-11-18 16:07:44+00:00 at nextFunction (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:574:14)
p4r4v
2019-11-18 16:07:44+00:00 at /app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/cursor.js:535:7
p4r4v
2019-11-18 16:07:44+00:00 at queryCallback (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/wireprotocol/2_6_support.js:77:7)
p4r4v
2019-11-18 16:07:44+00:00 at /app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/pool.js:532:18
p4r4v
2019-11-18 16:07:44+00:00 at _combinedTickCallback (internal/process/next_tick.js:132:7)
p4r4v
2019-11-18 16:07:44+00:00 at process._tickDomainCallback (internal/process/next_tick.js:219:9)
p4r4v
2019-11-18 16:07:44+00:00 => awaited here:
p4r4v
2019-11-18 16:07:44+00:00 at Function.Promise.await (/app/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12)
p4r4v
2019-11-18 16:07:44+00:00 at Promise.asyncApply (packages/mongo/mongo_driver.js:1042:14)
p4r4v
2019-11-18 16:07:44+00:00 at /app/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40
p4r4v
2019-11-18 16:07:44+00:00 => awaited here:
p4r4v
2019-11-18 16:07:44+00:00 at Promise.await (/app/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:60:12)
p4r4v
2019-11-18 16:07:44+00:00 at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1091:12)
p4r4v
2019-11-18 16:07:44+00:00 at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1105:22)
p4r4v
2019-11-18 16:07:44+00:00 at Cursor.(anonymous function) (packages/mongo/mongo_driver.js:887:44)
p4r4v
2019-11-18 16:07:44+00:00 at Cursor.cursorProto.(anonymous function) (packages/mdg:meteor-apm-agent/lib/hijack/db.js:131:32)
p4r4v
2019-11-18 16:07:44+00:00 at Cursor.kadira_Cursor_forEach [as forEach] (packages/mdg:meteor-apm-agent/lib/hijack/set_labels.js:60:34)
p4r4v
2019-11-18 16:07:44+00:00 at OplogObserveDriver._runQuery (packages/mongo/oplog_observe_driver.js:729:16)
p4r4v
2019-11-18 16:07:44+00:00 at OplogObserveDriver._runInitialQuery (packages/mongo/oplog_observe_driver.js:658:10)
p4r4v
2019-11-18 16:07:44+00:00 at packages/mongo/oplog_observe_driver.js:191:10
p4r4v
2019-11-18 16:07:44+00:00 at packages/mongo/oplog_observe_driver.js:15:9
p4r4v
2019-11-18 16:07:44+00:00 at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1234:12)
p4r4v
2019-11-18 16:07:44+00:00 at packages/meteor.js:550:25
p4r4v
2019-11-18 16:07:44+00:00 at runWithEnvironment (packages/meteor.js:1286:24)
p4r4v
2019-11-18 16:07:45+00:00Got exception while polling query MongoError: getMore executor error: Overflow sort stage buffered data usage of 33557607 bytes exceeds internal limit of 33554432 bytes

Any chance you figured this out?

Hi @skaridamm. Nope. It has not re-occurred since. I reviewed all my publication code and added a couple of indexes to support some sort queries. But these were shots in the dark because of the lack of specificity in the error message. Maybe I accidentally solved the underlying root cause query.

I would prefer that that the oplog code tries a certain number of times, maybe with an exponential backoff strategy, but then eventually stops re-trying the query.