Meteor “login” method never invokes callback, blocks all future RPC calls

zcompton · October 2, 2018, 1:32pm

So I ran into this login issue again this morning and a few days ago. Both times on a local server and I had received MongoNetworkErrors before the issue started happening.

Here is the call stack:

I20181001-21:42:26.563(-4)? Exception in setInterval callback: MongoNetworkError: connection 67 to xxxxx.servers.xxxxxxxxx.com:27017 timed out
I20181001-21:42:26.587(-4)?     at TLSSocket.<anonymous> (C:\Users\######\AppData\Local\.meteor\packages\npm-mongo\3.0.7\npm\node_modules\mongodb-core\lib\connection\connection.js:258:7)
I20181001-21:42:26.587(-4)?     at Object.onceWrapper (events.js:313:30)
I20181001-21:42:26.587(-4)?     at emitNone (events.js:106:13)
I20181001-21:42:26.588(-4)?     at TLSSocket.emit (events.js:208:7)
I20181001-21:42:26.588(-4)?     at TLSSocket.Socket._onTimeout (net.js:410:8)
I20181001-21:42:26.588(-4)?     at ontimeout (timers.js:498:11)
I20181001-21:42:26.588(-4)?     at tryOnTimeout (timers.js:323:5)
I20181001-21:42:26.588(-4)?     at Timer.listOnTimeout (timers.js:290:5)
I20181001-21:42:26.617(-4)?  => awaited here:
I20181001-21:42:26.618(-4)?     at Function.Promise.await (C:\Users\######\AppData\Local\.meteor\packages\promise\0.11.1\npm\node_modules\meteor-promise\promise_server.js:56:12)
I20181001-21:42:26.619(-4)?     at Promise.asyncApply (packages/mongo/mongo_driver.js:1034:11)
I20181001-21:42:26.620(-4)?     at C:\Users\######\AppData\Local\.meteor\packages\promise\0.11.1\npm\node_modules\meteor-promise\fiber_pool.js:43:40
I20181001-21:42:26.621(-4)?  => awaited here:
I20181001-21:42:26.622(-4)?     at Promise.await (C:\Users\######\AppData\Local\.meteor\packages\promise\0.11.1\npm\node_modules\meteor-promise\promise_server.js:60:12)
I20181001-21:42:26.623(-4)?     at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1083:12)
I20181001-21:42:26.623(-4)?     at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1097:22)
I20181001-21:42:26.637(-4)?     at SynchronousCursor.map (packages/mongo/mongo_driver.js:1107:10)
I20181001-21:42:26.637(-4)?     at SynchronousCursor.fetch (packages/mongo/mongo_driver.js:1131:17)
I20181001-21:42:26.640(-4)?     at Cursor.(anonymous function) (packages/mongo/mongo_driver.js:879:44)
I20181001-21:42:26.640(-4)?     at Cursor.cursorProto.(anonymous function) (packages\meteorhacks_kadira.js:3352:32)
I20181001-21:42:26.641(-4)?     at Cursor.kadira_Cursor_fetch [as fetch] (packages\meteorhacks_kadira.js:3744:32)
I20181001-21:42:26.641(-4)?     at Object.<anonymous> (packages\matb33_collection-hooks.js:445:81)
I20181001-21:42:26.642(-4)?     at Object.collection.(anonymous function) [as update] (packages\matb33_collection-hooks.js:146:21)
I20181001-21:42:26.643(-4)?     at Collection.update (packages/mongo/collection.js:580:31)
I20181001-21:42:26.643(-4)?     at Collection.Mongo.Collection.(anonymous function) [as update] (packages/aldeed:collection2/collection2.js:213:19)
I20181001-21:42:26.644(-4)?     at AccountsServer.Ap._expireTokens (packages/accounts-base/accounts_server.js:1129:14)
I20181001-21:42:26.645(-4)?     at packages/accounts-base/accounts_server.js:1218:14
I20181001-21:42:26.645(-4)?     at Meteor.EnvironmentVariable.EVp.withValue (packages\meteor.js:1189:12)
I20181001-21:42:26.646(-4)?     at packages\meteor.js:505:25
I20181001-21:42:26.646(-4)?     at runWithEnvironment (packages\meteor.js:1241:24)

znewsham · October 4, 2018, 4:52pm

I too am having issues - for me this is the problem:

znewsham · October 26, 2018, 11:03am

@evolross @zcompton

do either of you have a readPreference=primray on your oplog URL? I added that and this problem went away - I tested it pretty rigourously in a dev environment, and just tested with a graceful primary stepdown in production. I thought readPreference=primary was the implicit default, but its possible meteor does something different when the read preference is explicitly defined

evolross · October 26, 2018, 11:28pm

I’m not using the oplog URL. I use redis-oplog. I haven’t seen this problem in a while. But I’m pretty sure I experienced the problem while using redis-oplog so I’m not sure if it’s oplog URL related.

typ · October 30, 2018, 9:18pm

We experience the problem even with using redis-oplog, so it’s not url related.

zcompton · November 20, 2018, 4:21pm

We do not specify any read preference in our MongoURLs which means it should default to primary.

evolross · February 27, 2019, 9:19pm

Is anyone still experiencing this? I just had it happen again today in production on one of our Galaxy containers. Have the following error over and over for each query on that container. Looks like it’s for all our polling queries. We don’t use the oplog. We use redis-oplog and AtlasDB and a few polling queries:

9vrwg - Exception while polling query ... MongoError: connection 97 to <server_address> timed out
9vrwg - 2019-02-27 11:15:43-08:00 at Object.Future.wait (/app/bundle/programs/server/node_modules/fibers/future.js:449:15)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._nextObject (packages/mongo/mongo_driver.js:1053:47)
9vrwg -2019-02-27 11:15:43-08:00 at [object Object]._.extend.forEach (packages/mongo/mongo_driver.js:1087:22)
9vrwg -2019-02-27 11:15:43-08:00 at [object Object]._.extend.getRawObjects (packages/mongo/mongo_driver.js:1136:12)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._pollMongo (packages/mongo/polling_observe_driver.js:153:44)
9vrwg - 2019-02-27 11:15:43-08:00 at Object.task (packages/mongo/polling_observe_driver.js:90:12)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._run (packages/meteor.js:811:18)
9vrwg - 2019-02-27 11:15:43-08:00 at packages/meteor.js:789:14
9vrwg - 2019-02-27 11:15:43-08:00 - - - - -
9vrwg - 2019-02-27 11:15:43-08:00 at Function.MongoError.create (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/error.js:29:11)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.<anonymous> (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:186:20)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.g (events.js:260:16)
9vrwg - 2019-02-27 11:15:43-08:00 at emitNone (events.js:67:13)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.emit (events.js:166:7)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.Socket._onTimeout (net.js:336:8)
9vrwg - 2019-02-27 11:15:43-08:00 at _runOnTimeout (timers.js:537:11)
9vrwg - 2019-02-27 11:15:43-08:00 at _makeTimerTimeout (timers.js:528:3)
9vrwg - 2019-02-27 11:15:43-08:00 at Timer.unrefTimeout (timers.js:597:5)

nisaus · January 16, 2020, 4:19am

Does anyone have any leads on this?

We just had this happen to one user twice in production; once last week and once this week. Last week we weren’t using Oplog but this week we are so it probably isn’t related to Oplog.

I am using MontiAPM and there are no errors or any indications whatsoever.

marceloprado · March 8, 2020, 4:14pm

Are there any updates on why is this happening? I’m getting this behavior for a couple of methods in production env. It seems you must have an active subscription to the collection to make it happen.

I’ve spent the entire past week trying to figure this out but I did not have any successful results. As everyone pointed out, it’s just the “updated” message from DDP that does not get received. My subscription reloads properly, but the cb never fires.

We use Meteor.callPromise all over the place. I changed one of the problematic methods to the raw Meteor.call but nothing happened.

Some points:

I could not reproduce this in development neither staging env.
Takes roughly 1 hour after deploy to see the error.
Running on AWS (Atlas)
Using default oplog tailing for all subscriptions
Using the following MONGO_DB_URL: mongodb+srv://<user>:<password>@<address>.mongodb.net/<db>?retryWrites=true&w=majority
Using the following MONGO_OPLOG_URL: mongodb+srv://<oplog-user>:<password>@<address>.mongodb.net/local?retryWrites=true&w=majority
Meteor version: 1.9
Mongo version: 3.6.17

znewsham · March 11, 2020, 5:04pm

I switched to redis oplog, and the problems went away after this - not sure if it is directly related.

dooger · January 30, 2021, 5:26am

has anyone solved this issue? We’re having it intermittently in production

jancurn · February 22, 2021, 12:25pm

We have the same problem, any help would be highly appreciated

paulishca · February 22, 2021, 3:09pm

I wonder if this issue is related in any way to:

number of registered users
the same issue in this thread: Slow mongo query on createUser
if this happens on login with email, with username with a certain social provider. Should we do a polling?

peterfkruger · February 22, 2021, 9:28pm

This still happens in the respective apps of both @andregoldstein and @waldgeist. We have found no resolution quite yet.

In both cases the error occurs exclusively in production and only after several days of server uptime.
It doesn’t seem to be related to login in either of these apps.
There is some indication in one of the apps that having switched to redis oplog solved the problem, but it’s too early to tell for sure.

alawi · February 23, 2021, 12:27am

This is a difficult issue to trace since many factors are at play during production. I have a deployment running for 11 months now with pub/sub and constant use and not a single restart. So, I’m wondering what is the difference with those apps.

Are you sure both of your apps are hitting the same error path?
Is it specific to a Meteor version/update?
Is it specific to a deployment environment?
How would someone test and isolate this systematically?
Why the method call seem to stop working after considerable uptime? Is it RAM, CPU exhaustion? all threads getting blocked? Certain code path being trigger? A third party provider getting stuck? Does the server heartbeat work? It is really hard to tell.

I think the problem statement in this thread is vague. It is hard to tell if they are even caused by the same underlying problem. And it seems no one is able to reproduce it systemically.

I think the best way to isolate this is to stress it one of those apps and let it hang when this error hit and then trying to narrow down and isolate the underlying cause.

jancurn · February 25, 2021, 10:24am

BTW I think the same issue is discussed on GitHub: Meteor logged in but Meteor.loggingIn() stuck on true / initial login method works but subsequent methods never complete · Issue #11323 · meteor/meteor · GitHub

dooger · March 2, 2021, 12:41am

We solved our issue by converting to RedisOplog. My working theory is there is a bug in meteor oplog polling that that leads to dirty containers under heavy loads or soemthign else going on in the DB. It kept happening to us randomly… sometimes quickly and sometimes after a few days of running with no issues. If anyone gets stuck on this I recommend converting to RedisOplog, wasn’t too painful to get going.

alawi · March 2, 2021, 4:30am

That’s good to hear.

But how did you rule out is it not CPU exhaustion under heavy load? Also, why do you think this bug doesn’t show systemically for all apps using OpLog under heavyload? What is the difference here?

dooger · March 2, 2021, 4:42am

We had the problem sometimes with no CPU load. We tried scaling up the containers and it still happened, maybe even worse. It’s some kind of bug in the mongo-oplog code I think, but we only ever saw it in our prod environment on AWS.

alawi · March 2, 2021, 4:47am

But if it bug in mongodb oplog code then surely many would have seen it so something is not adding up.