Meteor “login” method never invokes callback, blocks all future RPC calls

evolross · February 27, 2019, 9:19pm

Is anyone still experiencing this? I just had it happen again today in production on one of our Galaxy containers. Have the following error over and over for each query on that container. Looks like it’s for all our polling queries. We don’t use the oplog. We use redis-oplog and AtlasDB and a few polling queries:

9vrwg - Exception while polling query ... MongoError: connection 97 to <server_address> timed out
9vrwg - 2019-02-27 11:15:43-08:00 at Object.Future.wait (/app/bundle/programs/server/node_modules/fibers/future.js:449:15)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._nextObject (packages/mongo/mongo_driver.js:1053:47)
9vrwg -2019-02-27 11:15:43-08:00 at [object Object]._.extend.forEach (packages/mongo/mongo_driver.js:1087:22)
9vrwg -2019-02-27 11:15:43-08:00 at [object Object]._.extend.getRawObjects (packages/mongo/mongo_driver.js:1136:12)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._pollMongo (packages/mongo/polling_observe_driver.js:153:44)
9vrwg - 2019-02-27 11:15:43-08:00 at Object.task (packages/mongo/polling_observe_driver.js:90:12)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._run (packages/meteor.js:811:18)
9vrwg - 2019-02-27 11:15:43-08:00 at packages/meteor.js:789:14
9vrwg - 2019-02-27 11:15:43-08:00 - - - - -
9vrwg - 2019-02-27 11:15:43-08:00 at Function.MongoError.create (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/error.js:29:11)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.<anonymous> (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:186:20)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.g (events.js:260:16)
9vrwg - 2019-02-27 11:15:43-08:00 at emitNone (events.js:67:13)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.emit (events.js:166:7)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.Socket._onTimeout (net.js:336:8)
9vrwg - 2019-02-27 11:15:43-08:00 at _runOnTimeout (timers.js:537:11)
9vrwg - 2019-02-27 11:15:43-08:00 at _makeTimerTimeout (timers.js:528:3)
9vrwg - 2019-02-27 11:15:43-08:00 at Timer.unrefTimeout (timers.js:597:5)

nisaus · January 16, 2020, 4:19am

Does anyone have any leads on this?

We just had this happen to one user twice in production; once last week and once this week. Last week we weren’t using Oplog but this week we are so it probably isn’t related to Oplog.

I am using MontiAPM and there are no errors or any indications whatsoever.

marceloprado · March 8, 2020, 4:14pm

Are there any updates on why is this happening? I’m getting this behavior for a couple of methods in production env. It seems you must have an active subscription to the collection to make it happen.

I’ve spent the entire past week trying to figure this out but I did not have any successful results. As everyone pointed out, it’s just the “updated” message from DDP that does not get received. My subscription reloads properly, but the cb never fires.

We use Meteor.callPromise all over the place. I changed one of the problematic methods to the raw Meteor.call but nothing happened.

Some points:

I could not reproduce this in development neither staging env.
Takes roughly 1 hour after deploy to see the error.
Running on AWS (Atlas)
Using default oplog tailing for all subscriptions
Using the following MONGO_DB_URL: mongodb+srv://<user>:<password>@<address>.mongodb.net/<db>?retryWrites=true&w=majority
Using the following MONGO_OPLOG_URL: mongodb+srv://<oplog-user>:<password>@<address>.mongodb.net/local?retryWrites=true&w=majority
Meteor version: 1.9
Mongo version: 3.6.17

znewsham · March 11, 2020, 5:04pm

I switched to redis oplog, and the problems went away after this - not sure if it is directly related.

dooger · January 30, 2021, 5:26am

has anyone solved this issue? We’re having it intermittently in production

jancurn · February 22, 2021, 12:25pm

We have the same problem, any help would be highly appreciated

paulishca · February 22, 2021, 3:09pm

I wonder if this issue is related in any way to:

number of registered users
the same issue in this thread: Slow mongo query on createUser
if this happens on login with email, with username with a certain social provider. Should we do a polling?

peterfkruger · February 22, 2021, 9:28pm

This still happens in the respective apps of both @andregoldstein and @waldgeist. We have found no resolution quite yet.

In both cases the error occurs exclusively in production and only after several days of server uptime.
It doesn’t seem to be related to login in either of these apps.
There is some indication in one of the apps that having switched to redis oplog solved the problem, but it’s too early to tell for sure.

alawi · February 23, 2021, 12:27am

This is a difficult issue to trace since many factors are at play during production. I have a deployment running for 11 months now with pub/sub and constant use and not a single restart. So, I’m wondering what is the difference with those apps.

Are you sure both of your apps are hitting the same error path?
Is it specific to a Meteor version/update?
Is it specific to a deployment environment?
How would someone test and isolate this systematically?
Why the method call seem to stop working after considerable uptime? Is it RAM, CPU exhaustion? all threads getting blocked? Certain code path being trigger? A third party provider getting stuck? Does the server heartbeat work? It is really hard to tell.

I think the problem statement in this thread is vague. It is hard to tell if they are even caused by the same underlying problem. And it seems no one is able to reproduce it systemically.

I think the best way to isolate this is to stress it one of those apps and let it hang when this error hit and then trying to narrow down and isolate the underlying cause.

jancurn · February 25, 2021, 10:24am

BTW I think the same issue is discussed on GitHub: Meteor logged in but Meteor.loggingIn() stuck on true / initial login method works but subsequent methods never complete · Issue #11323 · meteor/meteor · GitHub

dooger · March 2, 2021, 12:41am

We solved our issue by converting to RedisOplog. My working theory is there is a bug in meteor oplog polling that that leads to dirty containers under heavy loads or soemthign else going on in the DB. It kept happening to us randomly… sometimes quickly and sometimes after a few days of running with no issues. If anyone gets stuck on this I recommend converting to RedisOplog, wasn’t too painful to get going.

alawi · March 2, 2021, 4:30am

That’s good to hear.

But how did you rule out is it not CPU exhaustion under heavy load? Also, why do you think this bug doesn’t show systemically for all apps using OpLog under heavyload? What is the difference here?

dooger · March 2, 2021, 4:42am

We had the problem sometimes with no CPU load. We tried scaling up the containers and it still happened, maybe even worse. It’s some kind of bug in the mongo-oplog code I think, but we only ever saw it in our prod environment on AWS.

alawi · March 2, 2021, 4:47am

But if it bug in mongodb oplog code then surely many would have seen it so something is not adding up.

peterfkruger · March 2, 2021, 9:37am

It hasn’t been figured out yet what makes this bug appear in some apps and not in others. High CPU has been under suspicion, but it could not be confirmed.

It was reported by some that introducing redis oplog solved the issue for them.

@andregoldstein’s app was initially heavily affected by the bug (only in production and only after many days of uninterrupted use). He switched to redis oplog, which seemed to solve the issue at first sight. However he told me on Friday that some of it seems to have persisted; although the entire app doesn’t freeze up anymore, some of his subscriptions still get stuck sometimes by not returning at all.

typ · March 2, 2021, 9:51am

Hi all, saw this is getting some activity again both here and on Github. I’m the original poster and figured I’d provide an update.

We are still experiencing this (rarely, but it happens) despite a lot of changes and optimizations to our setup. A few notes on our setup:

We are not using standard Meteor oplog. We’re using redis-oplog and have been for over 2 years now. While it may relate to multiplexer code, the original oplog driver code isn’t in use for our codebase.
We are not using a 2nd server side DDP connection like TheRealNate mentioned here
We most frequently see it happen when compute times ramp up on the server. Think of something that might be blocking the event loop. For us, we found that certain rare operations would block the event loop long enough such that many other DDP invocations sent from clients connected to a server would get backed up. Eventually, the server “recovers” in that its CPU and Memory are normal, any new or existing DDP connections to that server get stuck in the dead state I originally described.

We also have seen this happen if our Mongo cluster gets backed up - similar to DDP invocations getting backed up when the event loop is blocked, servers can sometimes run into the same issue with Mongo slowness or large data sets being returned.

Filipe’s comment on the Github issue about a Fiber running but not returning is close to the conclusion we came to as well, but the issue is very difficult to catch in real-time or replicate. At this point we’re just very cognizant and cautious about how we handle large datasets to avoid this.

typ · March 2, 2021, 9:57am

Sorry, I should mention that we’ve also seen this happen in the event of a Mongo failover or Mongo configuration change quite often. Sometimes our Meteor servers can handle this, but in the unfortunate event that a failover or configuration change happens while we’re under decently heavy load, we can as anywhere from 20-50% of our Meteor servers experience this.

alawi · March 2, 2021, 10:27am

Thanks @typ thanks for insightful.

Actually, now that I recall, I’ve seen something like that happen when using an older version of MongoDB during stress test. The issue there was the MongoDB was locking and I guess so many requests got queued which causes the Meteor server to hang.

My guess is that, either an API/MongoDB Call or some I/O are not returning or not returning fast enough under traffic which eventually exhaust the Meteor threads causing it to get stuck. And if the load persist, the issue gets worst as more requests get queued. For those seeing improvement after using redis-oplog, I think the DB could be the issue here. But it could be something else it seems, like another DDP server or API etc. I don’t know if the Meteor/Node server has some defence, timeout or recovery logic from scenarios like this, I doubt it, unless someone knows more…it seems that once the server reaches this state, it will be dead in the water and will require a manual restart. So I guess as of now, one can only hope for protective measures, i.e. decreasing the chance of hitting this state by ensuring there is enough CPU in the DB and servers, and being careful when making IOs…

typ · March 2, 2021, 5:13pm

@alawi yeah, protective measures have been the only thing which has helped us. Redis oplog has been huge for that and I highly recommend trying it for anyone who hasn’t.

We have some load tests which simulate our common user interactions, so maybe we’ll try setting up the load tests against a staging environment and manually triggering a Mongo failover or configuration change to see if we can force the “stuck state”.

I’ll update here if we go that route as it might lead to a consistent replication of the issue, which I know the Meteor team will 100% need if they’re to even attempt solving this.

alawi · March 2, 2021, 5:44pm

That sounds promising, but perhaps there could be an easier way to get to that “stuck state”. Something like reducing the fibers threads pool size and force locking the DB under stress load…something like that, making it easier to get to that state systematically. Like you said, getting a systematic replication is solving half of the problem. I guess the solution would be a cleanup and recovery logic added to the server once this state is detected, also it is not clear (at least to me) where in the stack things are potentially getting stuck (node? fibers? meteor code?), perhaps someone is more familiar with the code can provide more insight here.