Meteor loggingIn stuck on true even though user is logged in

therealnate · February 14, 2021, 6:24pm

What hosting provider is your app running on? Mine is hosted on Galaxy. Are you also seeing that Meteor.loggingIn() remains stuck on true?

Another strange thing is I have deployed an identical app to Galaxy before thats been running for months, I’ve only started seeing these problems with a new instance of this codebase.

waldgeist · February 14, 2021, 9:49pm

Hi @therealnate No, I am running my instances on AWS. I did not check the loggingIn() state, so I cannot tell for sure. But the user was successfully logged in.

Apart from that: I was running two almost identical instances on AWS: one was affected by the problem (2x), while the other was not.

The problem went away as soon as I restarted the server. Yet it came back a couple of weeks later. There were theories that blocking methods caused this. But I am not really sure if this is the case, because all sessions were affected, not just the session of one particular user.

The problem also appeared cross-platform. We have two clients: one is based on Unity (so it’s basically a native app communicating via a custom DDP package), and the other is a regular Meteor web frontend based on React. When the problem showed up, all clients and all users were affected.

therealnate · February 14, 2021, 10:06pm

I have also seen that the issue stops upon restarting the server. However for myself I am completely killing and creating a new container.

I was running on Meteor 1.x but the error persisted after updating to 2.0. I’m also using Node 12.20.1 and NPM 6.14.8. Any overlap with you?

waldgeist · February 14, 2021, 10:08pm

The first time the problem occurred, I stopped and restarted the AWS instance, since I suspected some hardware failure. But the second time, I just did mup restart, and the problem went away, too. This is also when I started to analyze it further, since it didn’t seem to be a coincidence anymore. My server runs on Meteor 1.10.2, with the Node version / Docker container recommended by mup for this version.

peterfkruger · February 14, 2021, 10:10pm

I think we should take advantage of the opportunity of the moment: if the bug is reproducible in your app in the developer version, it should also be possible to debug the server, thus pinpointing where things go wrong.

There is assumption that one of these two things occurs:

in one of the methods there is an API call, HTTP or similar, that never actually returns, and therefore the corresponding Future keeps blocking that method; if no this.unblock() was called, the callback on the client never fires, and any subsequent method calls will be stuck too.
There is a bug in the DDP handling code that sometimes causes the Future to not get cleared.

Both situations should be detectable with remote debugging, I guess.

waldgeist · February 14, 2021, 10:13pm

No, it wasn’t reproducible in the dev environment. Not even in the staging environment. It only appeared in the prod environment. So I had to restart the server soon after it happened, but I did some analysis at that time.

Either this, or some Exception is not being caught. I read somewhere else that an un-catched Promise might get a whole Node server into a non-recoverable state.

The thing is: I am pretty sure that most of my methods are simple enough to not make this happen, especially those called on initial browser load. However, there might be some third-party package causing this. Dunno.

EDIT: Ah, sorry Peter, I didn’t see it was you responding to @therealnate I thought he responded to me.

peterfkruger · February 14, 2021, 10:20pm

No problem

I don’t think that a Promise can cause a problem. If there is an uncaught Promise error, it would be logged on the server console, or, in worst case it would crash the server, but neither is happening.

Another scenario I tested out was to return a Promise in a method that never clears:

return new Promise(()=> {})

What the above code does is to never fire the callback on the client pertaining to that method invocation, yet it does not affect any other subsequent method calls, neither other users, and the server remains fully operable, except that very method (which is broken by design).

therealnate · February 14, 2021, 10:50pm

I am also unable to reproduce it in dev or staging.

Correct me if I’m wrong, but the way Meteor works is unless you do this.unblock() the client won’t be able to call any other methods till the first in the queue resolves.

@waldgeist, from the client side of a user experiencing the issue, have you tried looking in the websocket message log from the dev tools? This may be able to tell you if its a method call thats not completing. In my case, I saw that the login method was indeed receiving a result, but nothing was posted to the websocket from the client after that (aside from the usual ping pong)

andregoldstein · February 15, 2021, 4:49pm

To confirm I believe that switching out oplog for redis oplog has made my problem disappear. Even if we’re not entirely sure what was bugging…

therealnate · February 15, 2021, 5:17pm

@andregoldstein, in my case I don’t have oplog turned on (hadn’t gotten around to it). @waldgeist, what about you?

waldgeist · February 15, 2021, 5:54pm

Oplog is on, though I just learned that it doesn’t work with all kind of queries (e.g. it doesn’t work with geospatial queries).

andregoldstein · February 15, 2021, 6:26pm

Isn’t it on by default? In any case swapping it out may be worth a go as it seemed to have solved a few issues like this before as per the Github link @peterfkruger linked to

therealnate · February 15, 2021, 7:14pm

After upgrading my database instance (which presumably reboots the whole cluster) on Atlas, I haven’t seen any issues in the last 48 hours. Maybe it was related to my specific cluster, maybe it was related to the size/ram/network capacity of the DB, or maybe it was a coincidence.

I don’t have MONGO_OPLOG_URL env variable setup, and Meteor APM confirms its not on right now.

peterfkruger · February 15, 2021, 8:08pm

Without having issued this.unblock(), the way DDP works is to follow the strict sequence order.

Methods and subscriptions almost always end up making some sort of API call via an underlying tcp connection, usually a mongo operation (Meteor Collection) — but it can be any other API call, such as using the packages Email, HTTP or similar.

Meteor uses Fibers to make most of these API calls synchronous. (This is in fact very convenient, although from today’s point of view it would be just as good to use async/await, as opposed to the obscure and non-standard Fiber stuff.)

Now, if that API call just never finishes, meaning the remote service just fails to deliver data and also to close the connection: that’s the recipe for disaster in a Meteor application. I’m not sure how plausible it is that MongoDB requests get stuck for an indefinite time, but we have at least some testimonies about Atlas occasionally acting up pretty badly.

The Fiber in place that made the call synchronous never gets cleared, hence the method (or subscription) gets stuck indefinitely. Consequently, barring this.unblock() the entire sequence of DDP messages will be stuck too, and there’s simply no mechanism in place to get out of that calamity.

The result is what we predictably see in some unfortunate apps: method callbacks aren’t called, and the app becomes non-responding. The only way to get it working again is to restart the Meteor instance.

The above is at least a possible scenario to explain what’s happening. But it may also be that there are multiple unrelated scenarios that all lead to blocking the DDP messages, und ultimately to freezing up the app.

therealnate · February 16, 2021, 10:50pm

Its now been 72 hours since I last saw the issue. It seems to have stopped after upgrading my Atlas cluster. Could be a coincidence, or could be the following:

Caused by an issue with the Cluster that upgrading fixed by recreating the Cluster
Caused by an issue due to limited RAM or CPU
Caused by an issue due to throttled/limited network

Again, this is all speculation, but after previously seeing the issue on a daily basis I haven’t seen it once after upgrading my Atlas cluster

therealnate · February 16, 2021, 11:23pm

@peterfkruger, when I was having the issue, I looked inside the websocket message logs. All the methods and subscriptions that were called by the client received a result/ready response. Maybe something is going wrong on the client side?

therealnate · February 17, 2021, 3:15am

I’m using this code to detect the issue on the client side then sending a push via RESt to a monitoring server:

  Accounts.onLogin(() => {
    if(Meteor.loggingIn()) {
      window.setTimeout(() => {
        if(Meteor.loggingIn()) {
          // Report
        }
      }, 2000);
    }
  });

Thanks to this, I have confirmed that the issue seems to still be present.

I have opened an issue on the Meteor GitHub repo: Meteor logged in but Meteor.loggingIn() stuck on true / initial login method works but subsequent methods never complete · Issue #11323 · meteor/meteor · GitHub

peterfkruger · February 17, 2021, 8:31am

Then we’re possibly having different issues, after all.

In @andregoldstein’s app we’ve built in a wrapper around every Meteor.call similar to your code monitoring unsuccessful logins, also using window.setTimeout.

By that we were able to confirm that callbacks of Meteor.call aren’t called anymore on the client once the phenomenon starts appearing. My understanding is that the same is happening in @waldgeist’s app, though I myself did not take part in formally confirming it.

andregoldstein · February 17, 2021, 4:28pm

Massive thanks again to @peterfkruger for all his help. Incredibly generous with his time

therealnate · February 24, 2021, 5:38pm

Hi All,

I have found the issue to the problem for me.

I was connecting to a second Meteor server simultaneously. The connection was crucial to the app, so it was established at startup.

I found that when you use DDP.connect, if the timing works out just right DDP.onReconnect can be fired as the login method is resolving. This kills the login method in its tracks, because the DDP.onReconnect connection is not the same as the connection the login method was called on. This in turn prevents anything else from executing.

If you are not using DDP.connect in your app, then your issue may be unrelated to mine.