Meteor “login” method never invokes callback, blocks all future RPC calls

I’m not using the oplog URL. I use redis-oplog. I haven’t seen this problem in a while. But I’m pretty sure I experienced the problem while using redis-oplog so I’m not sure if it’s oplog URL related.

We experience the problem even with using redis-oplog, so it’s not url related.

We do not specify any read preference in our MongoURLs which means it should default to primary.

Is anyone still experiencing this? I just had it happen again today in production on one of our Galaxy containers. Have the following error over and over for each query on that container. Looks like it’s for all our polling queries. We don’t use the oplog. We use redis-oplog and AtlasDB and a few polling queries:

9vrwg - Exception while polling query ... MongoError: connection 97 to <server_address> timed out
9vrwg - 2019-02-27 11:15:43-08:00 at Object.Future.wait (/app/bundle/programs/server/node_modules/fibers/future.js:449:15)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._nextObject (packages/mongo/mongo_driver.js:1053:47)
9vrwg -2019-02-27 11:15:43-08:00 at [object Object]._.extend.forEach (packages/mongo/mongo_driver.js:1087:22)
9vrwg -2019-02-27 11:15:43-08:00 at [object Object]._.extend.getRawObjects (packages/mongo/mongo_driver.js:1136:12)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._pollMongo (packages/mongo/polling_observe_driver.js:153:44)
9vrwg - 2019-02-27 11:15:43-08:00 at Object.task (packages/mongo/polling_observe_driver.js:90:12)
9vrwg - 2019-02-27 11:15:43-08:00 at [object Object]._.extend._run (packages/meteor.js:811:18)
9vrwg - 2019-02-27 11:15:43-08:00 at packages/meteor.js:789:14
9vrwg - 2019-02-27 11:15:43-08:00 - - - - -
9vrwg - 2019-02-27 11:15:43-08:00 at Function.MongoError.create (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/error.js:29:11)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.<anonymous> (/app/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:186:20)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.g (events.js:260:16)
9vrwg - 2019-02-27 11:15:43-08:00 at emitNone (events.js:67:13)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.emit (events.js:166:7)
9vrwg - 2019-02-27 11:15:43-08:00 at TLSSocket.Socket._onTimeout (net.js:336:8)
9vrwg - 2019-02-27 11:15:43-08:00 at _runOnTimeout (timers.js:537:11)
9vrwg - 2019-02-27 11:15:43-08:00 at _makeTimerTimeout (timers.js:528:3)
9vrwg - 2019-02-27 11:15:43-08:00 at Timer.unrefTimeout (timers.js:597:5)

Does anyone have any leads on this?

We just had this happen to one user twice in production; once last week and once this week. Last week we weren’t using Oplog but this week we are so it probably isn’t related to Oplog.

I am using MontiAPM and there are no errors or any indications whatsoever.

Are there any updates on why is this happening? I’m getting this behavior for a couple of methods in production env. It seems you must have an active subscription to the collection to make it happen.

I’ve spent the entire past week trying to figure this out but I did not have any successful results. As everyone pointed out, it’s just the “updated” message from DDP that does not get received. My subscription reloads properly, but the cb never fires.

We use Meteor.callPromise all over the place. I changed one of the problematic methods to the raw Meteor.call but nothing happened.

Some points:

  • I could not reproduce this in development neither staging env.
  • Takes roughly 1 hour after deploy to see the error.
  • Running on AWS (Atlas)
  • Using default oplog tailing for all subscriptions
  • Using the following MONGO_DB_URL: mongodb+srv://<user>:<password>@<address>.mongodb.net/<db>?retryWrites=true&w=majority
  • Using the following MONGO_OPLOG_URL: mongodb+srv://<oplog-user>:<password>@<address>.mongodb.net/local?retryWrites=true&w=majority
  • Meteor version: 1.9
  • Mongo version: 3.6.17

I switched to redis oplog, and the problems went away after this - not sure if it is directly related.

has anyone solved this issue? We’re having it intermittently in production

We have the same problem, any help would be highly appreciated

I wonder if this issue is related in any way to:

  • number of registered users
  • the same issue in this thread: Slow mongo query on createUser
  • if this happens on login with email, with username with a certain social provider. Should we do a polling?

This still happens in the respective apps of both @andregoldstein and @waldgeist. We have found no resolution quite yet.

  • In both cases the error occurs exclusively in production and only after several days of server uptime.
  • It doesn’t seem to be related to login in either of these apps.
  • There is some indication in one of the apps that having switched to redis oplog solved the problem, but it’s too early to tell for sure.

This is a difficult issue to trace since many factors are at play during production. I have a deployment running for 11 months now with pub/sub and constant use and not a single restart. So, I’m wondering what is the difference with those apps.

  1. Are you sure both of your apps are hitting the same error path?
  2. Is it specific to a Meteor version/update?
  3. Is it specific to a deployment environment?
  4. How would someone test and isolate this systematically?
  5. Why the method call seem to stop working after considerable uptime? Is it RAM, CPU exhaustion? all threads getting blocked? Certain code path being trigger? A third party provider getting stuck? Does the server heartbeat work? It is really hard to tell.

I think the problem statement in this thread is vague. It is hard to tell if they are even caused by the same underlying problem. And it seems no one is able to reproduce it systemically.

I think the best way to isolate this is to stress it one of those apps and let it hang when this error hit and then trying to narrow down and isolate the underlying cause.

1 Like

BTW I think the same issue is discussed on GitHub: Meteor logged in but Meteor.loggingIn() stuck on true / initial login method works but subsequent methods never complete · Issue #11323 · meteor/meteor · GitHub

We solved our issue by converting to RedisOplog. My working theory is there is a bug in meteor oplog polling that that leads to dirty containers under heavy loads or soemthign else going on in the DB. It kept happening to us randomly… sometimes quickly and sometimes after a few days of running with no issues. If anyone gets stuck on this I recommend converting to RedisOplog, wasn’t too painful to get going.

That’s good to hear.

But how did you rule out is it not CPU exhaustion under heavy load? Also, why do you think this bug doesn’t show systemically for all apps using OpLog under heavyload? What is the difference here?

We had the problem sometimes with no CPU load. We tried scaling up the containers and it still happened, maybe even worse. It’s some kind of bug in the mongo-oplog code I think, but we only ever saw it in our prod environment on AWS.

But if it bug in mongodb oplog code then surely many would have seen it so something is not adding up.

It hasn’t been figured out yet what makes this bug appear in some apps and not in others. High CPU has been under suspicion, but it could not be confirmed.

It was reported by some that introducing redis oplog solved the issue for them.

@andregoldstein’s app was initially heavily affected by the bug (only in production and only after many days of uninterrupted use). He switched to redis oplog, which seemed to solve the issue at first sight. However he told me on Friday that some of it seems to have persisted; although the entire app doesn’t freeze up anymore, some of his subscriptions still get stuck sometimes by not returning at all.

Hi all, saw this is getting some activity again both here and on Github. I’m the original poster and figured I’d provide an update.

We are still experiencing this (rarely, but it happens) despite a lot of changes and optimizations to our setup. A few notes on our setup:

  1. We are not using standard Meteor oplog. We’re using redis-oplog and have been for over 2 years now. While it may relate to multiplexer code, the original oplog driver code isn’t in use for our codebase.

  2. We are not using a 2nd server side DDP connection like TheRealNate mentioned here

  3. We most frequently see it happen when compute times ramp up on the server. Think of something that might be blocking the event loop. For us, we found that certain rare operations would block the event loop long enough such that many other DDP invocations sent from clients connected to a server would get backed up. Eventually, the server “recovers” in that its CPU and Memory are normal, any new or existing DDP connections to that server get stuck in the dead state I originally described.

We also have seen this happen if our Mongo cluster gets backed up - similar to DDP invocations getting backed up when the event loop is blocked, servers can sometimes run into the same issue with Mongo slowness or large data sets being returned.

Filipe’s comment on the Github issue about a Fiber running but not returning is close to the conclusion we came to as well, but the issue is very difficult to catch in real-time or replicate. At this point we’re just very cognizant and cautious about how we handle large datasets to avoid this.

2 Likes

Sorry, I should mention that we’ve also seen this happen in the event of a Mongo failover or Mongo configuration change quite often. Sometimes our Meteor servers can handle this, but in the unfortunate event that a failover or configuration change happens while we’re under decently heavy load, we can as anywhere from 20-50% of our Meteor servers experience this.

1 Like