A cure for "Topology is closed" and unresponsive app after Meteor 3 migration

permb · January 2, 2025, 7:08pm

We are still in the process of migrating our medium-sized (?) app to Meteor 3 but got derailed last month when deploying the Meteor 3 branch build to our staging environments (Kubernetes & ECS Fargate).

Most containers would go into a loop directly upon startup and spew stack traces like these multiple times per second:

MongoTopologyClosedError: Topology is closed
    at processWaitQueue (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/sdam/topology.ts:918:42)
    at Topology.selectServer (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/sdam/topology.ts:601:5)
    at tryOperation (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/operations/execute_operation.ts:190:31)
    at executeOperation (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/operations/execute_operation.ts:109:18)
    at runNextTicks (node:internal/process/task_queues:65:5)
    at listOnTimeout (node:internal/timers:555:9)
    at processTimers (node:internal/timers:529:7)
    at FindCursor._initialize (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/cursor/find_cursor.ts:72:22)
    at FindCursor.cursorInit (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/cursor/abstract_cursor.ts:727:21)
    at FindCursor.fetchBatch (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/cursor/abstract_cursor.ts:762:6)
    at FindCursor.next (/opt/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/src/cursor/abstract_cursor.ts:425:7)

The containers could still respond to http requests but the DDP communication became unresponsive so no method calls or subscriptions could be used.

After a lot of investigation we found the root cause and a solution today.

First some background:

When a Meteor starts up, one of the first things it creates is the oplog tailer

github.com

meteor/meteor/blob/bfc13f51aac291db808681d09f2760255c8b85c5/packages/mongo/oplog_tailing.ts#L95


      
              ]
            };
          
            this._catchingUpResolvers = [];
            this._lastProcessedTS = null;
          
            this._onSkippedEntriesHook = new Hook({
              debugPrintExceptions: "onSkippedEntries callback"
            });
          
            this._startTrailingPromise = this._startTailing();
          }
          
          async stop(): Promise<void> {
            if (this._stopped) return;
            this._stopped = true;
            if (this._tailHandle) {
              await this._tailHandle.stop();
            }
          }

After the async method _startTailing has been unleashed, all the other app startup code executes and in our case it took about 90 seconds with very little idle time for the mongodb driver to connect to the cluster as part of the _startTailing implementation and calls to “tail”.

So what we found was that the oploghandle’s connection would never connect properly because of the default 30 second server selection timeout, and meteor’s oploghandle implementation cannot recover from that.

The log messages that were emitted came from a neverending retry loop in a Meteor.defer call.

Anyhow, the simple cure was just to set a large enough serverSelectionTimeoutMS mongodb connection property via Meteor.settings (METEOR_SETTINGS):

{
  "public": {
    "packages": {
      "dynamic-import": {
        "useLocationOrigin": true
      }
    }
  },
  "packages": {
    "mongo": {
      "options": {
        "serverSelectionTimeoutMS": 120000
      }
    }
  }
}

leonardoventurini · January 6, 2025, 11:54am

Interesting, thanks for sharing @permb ! I am curious as to why it is taking so long to connect to Mongo. Is it a separate staging database instance? Perhaps the updated mongo driver could be the culprit, can you share more info on how your database instance is setup and connected to so we can rule out some things?

One of the things that come to mind is that if you replicate all your production data there, on a significantly smaller instance, could impact on that, but this is just a blind assumption. Was it connecting faster before Meteor 3.1?

permb · January 6, 2025, 1:20pm

It doesn’t really take long to connect in the normal case, but since nodejs is single-threaded it is sensitive to high cpu load without yields (event loop). In this case all of the startup code that runs does not give enough yields for the driver to perform all its network calls before the timeout.

In meteor 2 I suspect there is a fiber-backed Promise.await (or similar using a callback) much earlier in the process

metrich · January 7, 2025, 10:49am

Thanks for this, bro!

evolross · January 8, 2025, 3:32pm

So you’re using Oplog Tailing in a production app? I think most serious production apps have that disabled and use Redis Oplog.

permb · January 8, 2025, 4:14pm

YMMV - we only use subscriptions in very few places and have not seen any real issues yet…

hluz · January 9, 2025, 10:45pm

Depends on what you call a “serious production app”

We have production apps (enterprise level apps, rather than “public world” apps, and we are happily tailing oplog without Redis Oplog). We don’t use client side method stubs though, all updates are done via method calls, but Subscriptions are used extensively. Live-data is very useful in situations where many people share (real time) the same data that can get updated by someone else, without having to hit the browser reload button or rely on pooling to keep the data fresh on each users’s screen.

evolross · January 10, 2025, 2:22pm

One that handles a couple of thousand concurrent users.

But yes, if concurrent scale isn’t an issue I can understand just using out of the box oplog tailing.

rjdavid · January 11, 2025, 3:11am

My app that I am serious about