Meteor loggingIn stuck on true even though user is logged in

Hi everyone, seeing this weird issue on a new Galaxy deployment of a working codebase.

Meteor.loggingIn() is stuck on true, but Accounts.onLogin callback is executing, Meteor.user() is defined. Meteor.logout() has no effect, user remains even after reloading the page. Meteor.status() says connected

I looked at the raw messages in the websocket and everything looks normal:

  • Client calls connect, calls login with resume token, and subs to loginServiceConfiguration and autoupdate client versions
  • Server replies with server id, session, user object, result for login call, and marks the login service and autoupdate version subs as ready

I am seeing no logs on the server side

Update: It looks like method call block is remaining. I am not able to call any methods, I’ve checked in the websocket logs and when Meteor.call is run nothing is posted to the server.

Update #2: Restarting the server (with nothing changing on the client) fixes the issue, but I’ve seen the issue multiple times in the past few days so it keeps coming up. Should also note that the issue is not temporary, the client experiences it even after reloading the page, reopening the browser, etc.

Update #3: Noticed this issue in the console. Didn’t happen at the time of trying to login, but a bit earlier. Possibly related?

Exception in defer callback: TypeError: connection.close is not a function
at packages/accounts-base/accounts_server.js:968:22
at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1234:12)
at packages/meteor.js:550:25
at runWithEnvironment (packages/meteor.js:1286:24)

Update #4: Similar thread (Strange phenomenon: Subsequent methods or subs won't succeed - #64 by waldgeist) mentions that the could be related to Atlas. I am also using an Atlas M2 cluster. I am trying some fixing on Atlas and will see if the issue happens again.

Solution:

I was connecting to a second Meteor server simultaneously. The connection was crucial to the app, so it was established at startup.

I found that when you use DDP.connect, if the timing works out just right DDP.onReconnect can be fired as the login method is resolving. This kills the login method in its tracks, because the DDP.onReconnect connection is not the same as the connection the login method was called on. This in turn prevents anything else from executing.

If you are not using DDP.connect in your app, then your issue may be unrelated to mine.

Chances are that these problems are related to what has just recently been discussed in this thread (without a resolution): Strange phenomenon: Subsequent methods or subs won’t succeed

Also see: Method Callbacks not being called intermittently (github discussion)

Hi @peterfkruger, thanks for your response. Skimming through the thread I found this post mentioning that the issue could be tied to Atlas: Strange phenomenon: Subsequent methods or subs won't succeed - #64 by waldgeist.

I’m going to try restarting/updating my Atlas cluster, but this is somewhat alarming as I only created the cluster about a week ago.

This was actually referring to another problem, which caused a huge delay in pub response times. It is not related to the original question mentioned in the thread, sorry that I mixed things up. I should have better created another thread for this. That’s also the reason why I did not mark the original thread as “resolved”. We might have fixed the problem by adding this.unblock() to all methods. But as I mentioned in the thread, adding this.unblock() to pubs as well caused other unwanted (and in my case serious) side-effects. So beware of doing this unless you don’t know exactly what you are doing.

Your problem indeed looks very similar to what @peterfkruger and I are experiencing. It happens every once in a while, sometimes after weeks of normal operation. I learned that it affects each connection, but only after the initial method / sub calls. The login works fine, and also the connection is up and running But all further method calls and subs never return. Haven’t seen the connection.close exception, though. In my case, the methods and subs just die silently.

@peterfkruger Do I remember correctly that your customer wasn’t using ATLAS? Otherwise, this might be a hint.

It’s @andregoldstein’s app that shows seemingly the exactly same phenomenon as yours (the original one with the hanging methods). He did try three different db providers while trying to find out whether the bug was related to the db hosting, with Atlas being one of them, but the error seems unrelated to it.

2 Likes

Thanks for the clarification. I also don’t think that it is related to ATLAS.

Hi @waldgeist

What hosting provider is your app running on? Mine is hosted on Galaxy. Are you also seeing that Meteor.loggingIn() remains stuck on true?

Another strange thing is I have deployed an identical app to Galaxy before thats been running for months, I’ve only started seeing these problems with a new instance of this codebase.

Hi @therealnate No, I am running my instances on AWS. I did not check the loggingIn() state, so I cannot tell for sure. But the user was successfully logged in.

Apart from that: I was running two almost identical instances on AWS: one was affected by the problem (2x), while the other was not.

The problem went away as soon as I restarted the server. Yet it came back a couple of weeks later. There were theories that blocking methods caused this. But I am not really sure if this is the case, because all sessions were affected, not just the session of one particular user.

The problem also appeared cross-platform. We have two clients: one is based on Unity (so it’s basically a native app communicating via a custom DDP package), and the other is a regular Meteor web frontend based on React. When the problem showed up, all clients and all users were affected.

I have also seen that the issue stops upon restarting the server. However for myself I am completely killing and creating a new container.

I was running on Meteor 1.x but the error persisted after updating to 2.0. I’m also using Node 12.20.1 and NPM 6.14.8. Any overlap with you?

The first time the problem occurred, I stopped and restarted the AWS instance, since I suspected some hardware failure. But the second time, I just did mup restart, and the problem went away, too. This is also when I started to analyze it further, since it didn’t seem to be a coincidence anymore. My server runs on Meteor 1.10.2, with the Node version / Docker container recommended by mup for this version.

I think we should take advantage of the opportunity of the moment: if the bug is reproducible in your app in the developer version, it should also be possible to debug the server, thus pinpointing where things go wrong.

There is assumption that one of these two things occurs:

  1. in one of the methods there is an API call, HTTP or similar, that never actually returns, and therefore the corresponding Future keeps blocking that method; if no this.unblock() was called, the callback on the client never fires, and any subsequent method calls will be stuck too.
  2. There is a bug in the DDP handling code that sometimes causes the Future to not get cleared.

Both situations should be detectable with remote debugging, I guess.

No, it wasn’t reproducible in the dev environment. Not even in the staging environment. It only appeared in the prod environment. So I had to restart the server soon after it happened, but I did some analysis at that time.

Either this, or some Exception is not being caught. I read somewhere else that an un-catched Promise might get a whole Node server into a non-recoverable state.

The thing is: I am pretty sure that most of my methods are simple enough to not make this happen, especially those called on initial browser load. However, there might be some third-party package causing this. Dunno.

EDIT: Ah, sorry Peter, I didn’t see it was you responding to @therealnate I thought he responded to me. :slight_smile:

1 Like

No problem :wink:

I don’t think that a Promise can cause a problem. If there is an uncaught Promise error, it would be logged on the server console, or, in worst case it would crash the server, but neither is happening.

Another scenario I tested out was to return a Promise in a method that never clears:

return new Promise(()=> {})

What the above code does is to never fire the callback on the client pertaining to that method invocation, yet it does not affect any other subsequent method calls, neither other users, and the server remains fully operable, except that very method (which is broken by design).

I am also unable to reproduce it in dev or staging.

Correct me if I’m wrong, but the way Meteor works is unless you do this.unblock() the client won’t be able to call any other methods till the first in the queue resolves.

@waldgeist, from the client side of a user experiencing the issue, have you tried looking in the websocket message log from the dev tools? This may be able to tell you if its a method call thats not completing. In my case, I saw that the login method was indeed receiving a result, but nothing was posted to the websocket from the client after that (aside from the usual ping pong)

1 Like

To confirm I believe that switching out oplog for redis oplog has made my problem disappear. Even if we’re not entirely sure what was bugging…

1 Like

@andregoldstein, in my case I don’t have oplog turned on (hadn’t gotten around to it). @waldgeist, what about you?

Oplog is on, though I just learned that it doesn’t work with all kind of queries (e.g. it doesn’t work with geospatial queries).

Isn’t it on by default? In any case swapping it out may be worth a go as it seemed to have solved a few issues like this before as per the Github link @peterfkruger linked to

After upgrading my database instance (which presumably reboots the whole cluster) on Atlas, I haven’t seen any issues in the last 48 hours. Maybe it was related to my specific cluster, maybe it was related to the size/ram/network capacity of the DB, or maybe it was a coincidence.

I don’t have MONGO_OPLOG_URL env variable setup, and Meteor APM confirms its not on right now.

1 Like

Without having issued this.unblock(), the way DDP works is to follow the strict sequence order.

Methods and subscriptions almost always end up making some sort of API call via an underlying tcp connection, usually a mongo operation (Meteor Collection) — but it can be any other API call, such as using the packages Email, HTTP or similar.

Meteor uses Fibers to make most of these API calls synchronous. (This is in fact very convenient, although from today’s point of view it would be just as good to use async/await, as opposed to the obscure and non-standard Fiber stuff.)

Now, if that API call just never finishes, meaning the remote service just fails to deliver data and also to close the connection: that’s the recipe for disaster in a Meteor application. I’m not sure how plausible it is that MongoDB requests get stuck for an indefinite time, but we have at least some testimonies about Atlas occasionally acting up pretty badly.

The Fiber in place that made the call synchronous never gets cleared, hence the method (or subscription) gets stuck indefinitely. Consequently, barring this.unblock() the entire sequence of DDP messages will be stuck too, and there’s simply no mechanism in place to get out of that calamity.

The result is what we predictably see in some unfortunate apps: method callbacks aren’t called, and the app becomes non-responding. The only way to get it working again is to restart the Meteor instance.

The above is at least a possible scenario to explain what’s happening. But it may also be that there are multiple unrelated scenarios that all lead to blocking the DDP messages, und ultimately to freezing up the app.