Meteor loggingIn stuck on true even though user is logged in

Hi @waldgeist

What hosting provider is your app running on? Mine is hosted on Galaxy. Are you also seeing that Meteor.loggingIn() remains stuck on true?

Another strange thing is I have deployed an identical app to Galaxy before thats been running for months, I’ve only started seeing these problems with a new instance of this codebase.

Hi @therealnate No, I am running my instances on AWS. I did not check the loggingIn() state, so I cannot tell for sure. But the user was successfully logged in.

Apart from that: I was running two almost identical instances on AWS: one was affected by the problem (2x), while the other was not.

The problem went away as soon as I restarted the server. Yet it came back a couple of weeks later. There were theories that blocking methods caused this. But I am not really sure if this is the case, because all sessions were affected, not just the session of one particular user.

The problem also appeared cross-platform. We have two clients: one is based on Unity (so it’s basically a native app communicating via a custom DDP package), and the other is a regular Meteor web frontend based on React. When the problem showed up, all clients and all users were affected.

I have also seen that the issue stops upon restarting the server. However for myself I am completely killing and creating a new container.

I was running on Meteor 1.x but the error persisted after updating to 2.0. I’m also using Node 12.20.1 and NPM 6.14.8. Any overlap with you?

The first time the problem occurred, I stopped and restarted the AWS instance, since I suspected some hardware failure. But the second time, I just did mup restart, and the problem went away, too. This is also when I started to analyze it further, since it didn’t seem to be a coincidence anymore. My server runs on Meteor 1.10.2, with the Node version / Docker container recommended by mup for this version.

I think we should take advantage of the opportunity of the moment: if the bug is reproducible in your app in the developer version, it should also be possible to debug the server, thus pinpointing where things go wrong.

There is assumption that one of these two things occurs:

  1. in one of the methods there is an API call, HTTP or similar, that never actually returns, and therefore the corresponding Future keeps blocking that method; if no this.unblock() was called, the callback on the client never fires, and any subsequent method calls will be stuck too.
  2. There is a bug in the DDP handling code that sometimes causes the Future to not get cleared.

Both situations should be detectable with remote debugging, I guess.

No, it wasn’t reproducible in the dev environment. Not even in the staging environment. It only appeared in the prod environment. So I had to restart the server soon after it happened, but I did some analysis at that time.

Either this, or some Exception is not being caught. I read somewhere else that an un-catched Promise might get a whole Node server into a non-recoverable state.

The thing is: I am pretty sure that most of my methods are simple enough to not make this happen, especially those called on initial browser load. However, there might be some third-party package causing this. Dunno.

EDIT: Ah, sorry Peter, I didn’t see it was you responding to @therealnate I thought he responded to me. :slight_smile:

1 Like

No problem :wink:

I don’t think that a Promise can cause a problem. If there is an uncaught Promise error, it would be logged on the server console, or, in worst case it would crash the server, but neither is happening.

Another scenario I tested out was to return a Promise in a method that never clears:

return new Promise(()=> {})

What the above code does is to never fire the callback on the client pertaining to that method invocation, yet it does not affect any other subsequent method calls, neither other users, and the server remains fully operable, except that very method (which is broken by design).

I am also unable to reproduce it in dev or staging.

Correct me if I’m wrong, but the way Meteor works is unless you do this.unblock() the client won’t be able to call any other methods till the first in the queue resolves.

@waldgeist, from the client side of a user experiencing the issue, have you tried looking in the websocket message log from the dev tools? This may be able to tell you if its a method call thats not completing. In my case, I saw that the login method was indeed receiving a result, but nothing was posted to the websocket from the client after that (aside from the usual ping pong)

1 Like

To confirm I believe that switching out oplog for redis oplog has made my problem disappear. Even if we’re not entirely sure what was bugging…

1 Like

@andregoldstein, in my case I don’t have oplog turned on (hadn’t gotten around to it). @waldgeist, what about you?

Oplog is on, though I just learned that it doesn’t work with all kind of queries (e.g. it doesn’t work with geospatial queries).

Isn’t it on by default? In any case swapping it out may be worth a go as it seemed to have solved a few issues like this before as per the Github link @peterfkruger linked to

After upgrading my database instance (which presumably reboots the whole cluster) on Atlas, I haven’t seen any issues in the last 48 hours. Maybe it was related to my specific cluster, maybe it was related to the size/ram/network capacity of the DB, or maybe it was a coincidence.

I don’t have MONGO_OPLOG_URL env variable setup, and Meteor APM confirms its not on right now.

1 Like

Without having issued this.unblock(), the way DDP works is to follow the strict sequence order.

Methods and subscriptions almost always end up making some sort of API call via an underlying tcp connection, usually a mongo operation (Meteor Collection) — but it can be any other API call, such as using the packages Email, HTTP or similar.

Meteor uses Fibers to make most of these API calls synchronous. (This is in fact very convenient, although from today’s point of view it would be just as good to use async/await, as opposed to the obscure and non-standard Fiber stuff.)

Now, if that API call just never finishes, meaning the remote service just fails to deliver data and also to close the connection: that’s the recipe for disaster in a Meteor application. I’m not sure how plausible it is that MongoDB requests get stuck for an indefinite time, but we have at least some testimonies about Atlas occasionally acting up pretty badly.

The Fiber in place that made the call synchronous never gets cleared, hence the method (or subscription) gets stuck indefinitely. Consequently, barring this.unblock() the entire sequence of DDP messages will be stuck too, and there’s simply no mechanism in place to get out of that calamity.

The result is what we predictably see in some unfortunate apps: method callbacks aren’t called, and the app becomes non-responding. The only way to get it working again is to restart the Meteor instance.

The above is at least a possible scenario to explain what’s happening. But it may also be that there are multiple unrelated scenarios that all lead to blocking the DDP messages, und ultimately to freezing up the app.

Its now been 72 hours since I last saw the issue. It seems to have stopped after upgrading my Atlas cluster. Could be a coincidence, or could be the following:

  • Caused by an issue with the Cluster that upgrading fixed by recreating the Cluster
  • Caused by an issue due to limited RAM or CPU
  • Caused by an issue due to throttled/limited network

Again, this is all speculation, but after previously seeing the issue on a daily basis I haven’t seen it once after upgrading my Atlas cluster

@peterfkruger, when I was having the issue, I looked inside the websocket message logs. All the methods and subscriptions that were called by the client received a result/ready response. Maybe something is going wrong on the client side?

I’m using this code to detect the issue on the client side then sending a push via RESt to a monitoring server:

  Accounts.onLogin(() => {
    if(Meteor.loggingIn()) {
      window.setTimeout(() => {
        if(Meteor.loggingIn()) {
          // Report
        }
      }, 2000);
    }
  });

Thanks to this, I have confirmed that the issue seems to still be present.

I have opened an issue on the Meteor GitHub repo: Meteor logged in but Meteor.loggingIn() stuck on true / initial login method works but subsequent methods never complete · Issue #11323 · meteor/meteor · GitHub

Then we’re possibly having different issues, after all.

In @andregoldstein’s app we’ve built in a wrapper around every Meteor.call similar to your code monitoring unsuccessful logins, also using window.setTimeout.

By that we were able to confirm that callbacks of Meteor.call aren’t called anymore on the client once the phenomenon starts appearing. My understanding is that the same is happening in @waldgeist’s app, though I myself did not take part in formally confirming it.

1 Like

Massive thanks again to @peterfkruger for all his help. Incredibly generous with his time

1 Like

Hi All,

I have found the issue to the problem for me.

I was connecting to a second Meteor server simultaneously. The connection was crucial to the app, so it was established at startup.

I found that when you use DDP.connect, if the timing works out just right DDP.onReconnect can be fired as the login method is resolving. This kills the login method in its tracks, because the DDP.onReconnect connection is not the same as the connection the login method was called on. This in turn prevents anything else from executing.

If you are not using DDP.connect in your app, then your issue may be unrelated to mine.

2 Likes