Strange phenomenon: Subsequent methods or subs won't succeed

waldgeist · January 20, 2021, 8:25pm

I noticed a strange phenomenon that happens every once in a while on our production server: If the users open up the app for the first time, they can execute a Meteor method as normal, everything works fine. But when they try to execute the same Meteor method again or if a subscription is updated, the server won’t respond anymore. The UI gets stuck because of this.

If I reload the browser page, I can execute one method again. Then the same problem repeats. It feels as if the DDP connection doesn’t work properly anymore and neither methods nor subscriptions return their values. Meteor.connection.status tells me I’m still connected.

If I restart the whole server, though, everything returns back to normal, the app is super-fast again and the problems go away. Then, after a couple of days or weeks, the same problem occurs.

On MontiAPM, I can see no issues. In the server logs, I see some failed Facebook login attempts, but besides this nothing special. If I run my app locally (or even on my staging system which has the same configuration as the prod system), I don’t see these problems.

Has anyone faced similar issues? Or any idea how to track these issues down to their root cause? I’m on Meteor 1.10.2, hosted on AWS and MongoDB Atlas.

peterfkruger · January 21, 2021, 1:18am

Refer to the thread Method Callbacks not being called intermittently on github. Several apps seem to have been affected over a long period of time: the issue was opened in 2016. No one so far has a definitive answer to what may cause this intermittent phenomenon in certain apps, and why it doesn’t affect others, but there are at least theories, none of which was so far confirmed.

There is one claim at the end of the thread that the problem goes away if redis-oplog is added to the app. Someone else claimed that an update of the package npm-mongo fixes the issue, but it didn’t seem to work for others.

And then there is this comment:

Just chiming in because I made a method one time that caused something very similar but wasn’t related. The problem ended up being that I had used a promise that never resolved. It was supposed to process all items in an array at once but an edge case made the array empty, causing my bad code to never resolve the promise. That blocked the fiber on the connection, making it look like this bug was occuring when it was actually my code.

I currently collaborate with someone else on this forum to help solving the very same phenomenon in their app; given that it is not reproducible, and only happens intermittently, we didn’t get very far yet. We should maybe coordinate our efforts!

rjdavid · January 21, 2021, 2:51am

Are you catching “unhandled rejection” errors in your app (server)? We had a similar issue and it turned out an unhandled rejection error causes the app to freeze and we were not aware until we restart the server then eventually it did happen again without any trace in our logs.

waldgeist · January 21, 2021, 9:02am

Thanks a lot for these explanations and pointing me to the GitHub issue. It looks as if my prod system might be affected by exactly that problem. I’m not using Redis Oplog right now. Regarding npm-mongo, I read in the thread that MDG has updated that in the meantime, so I think this root cause should be gone.

The comment about unresolved promises is also very interesting. I’m pretty sure that at least in the desktop browser version of our app there are no promises being used on the server-side. But I will double-check.

BTW: once it occurs, I am seeing this problem on all the platforms we support: Desktop, iOS, and Android. So it is not limited to iOS Safari, as some in the issue thread suggested.

waldgeist · January 21, 2021, 9:02am

What do you mean by “unhandled rejections”? Promise rejections?

waldgeist · January 21, 2021, 9:05am

Just added to the title that not only methods are affected, but also subs. Once the problem occurs, only the first subs succeed. Any other method calls or subs fail. They are initiated, but never receive their data.

So, it is not related to method callbacks on the client. Also sub data won’t get back to the client, so withTracker() code won’t re-run, causing my loading spinners to spin forever.

peterfkruger · January 21, 2021, 9:37am

Introducing handlers for uncaughtException and unhandledRejection is our best bet, thanks a lot @rjdavid. The latter has an unusual angle to it, given the inherently asynchronous nature of Promises. But anyway, at least a warning to the console won’t hurt.

Handling uncaughtException seems to be a totally different animal. I wasn’t aware of this part, for example:

Attempting to resume normally after an uncaught exception can be similar to pulling out the power cord when upgrading a computer. Nine out of ten times, nothing happens. But the tenth time, the system becomes corrupted.

The correct use of 'uncaughtException' is to perform synchronous cleanup of allocated resources (e.g. file descriptors, handles, etc) before shutting down the process. It is not safe to resume normal operation after 'uncaughtException' .

To restart a crashed application in a more reliable way, whether 'uncaughtException' is emitted or not, an external monitor should be employed in a separate process to detect application failures and recover or restart as needed.

rjdavid · January 21, 2021, 9:37am

Yes. We are catching those errors with something like this

// catch unhandledRejection error in the server
process.on(
  'unhandledRejection',
  Meteor.bindEnvironment(error => {
    Meteor._debug('Unhandled Rejection Error', error);
  })
);

We have a handler for debug to log messages

peterfkruger · January 21, 2021, 9:52am

@rjdavid what’s the advantage of using Meteor._debug as opposed to a simple console.error? I find it a little unsettling the the former is an undocumented internal function.

rjdavid · January 21, 2021, 10:28am

@peterfkruger we did function overriding for the default meteor debug function so that it logs the error messages in our error logger. I already forgot the history of this function (this survived a number of meteor projects)

waldgeist · January 21, 2021, 11:07am

This is interesting, indeed. I already noticed that unhandled Promise exceptions can bring a whole server to halt. I was facing this when I was implementing “Sign In With Apple”.

The existing meteor packages didn’t handle exceptions properly, so I had to work-around this via try-catch IIRC. I don’t think this is related to the problem above, but I am wondering if try-catch was a good idea in this case?

waldgeist · January 21, 2021, 11:09am

Thanks a lot for hinting me to process.on('unhandledRejection'). I wasn’t even aware this was possible.

waldgeist · January 21, 2021, 11:17am

One thing I’d like to add: I am pretty sure that we’re not using any promises on server-side if the user access our app via the web-frontend. But the problem appears here as well. Since the first method / sub calls succeeds (including auto-login), it seems to be related to the user connection / session.

peterfkruger · January 21, 2021, 12:05pm

Do you have any callbacks in your server code? If so, you would normally deal with them using Meteor.wrapAsync(). However, if there is an error thrown in the callback code itself would be a case for an uncaughtException, which, according to the description, may leave the server in limbo. I will definitely add a handler accordingly, not just a handler for unhandledRejection.

Actually it would be great if these handlers could be added to Meteor core one day. Apparently it’s easy to forget them, and then nothing indicates a problem that can disable the entire server.

peterfkruger · January 21, 2021, 12:12pm

I try-catch is always a good idea, even as a mere precaution.

I’ve just checked, I have 51 dedicated try-catch blocks in my app, 27 Promise.catch() functions, 55 callbacks wrapped in a try-catch wrapper function, and countless more code blocks indirectly wrapped in try-catch using other mechanisms such as Meteor method mixins and similar.

zodern · January 21, 2021, 6:38pm

I had a similar problem while running tests, which was caused by using sinon.useFakeTimers(). If you are using Meteor Up, you can run mup meteor debug to use node dev tools to inspect what is happening in the server. You might want to start by looking at https://github.com/meteor/meteor/blob/46e00a875726dff078c59eaf7fdc7b956cf0a996/packages/ddp-server/livedata_server.js#L536-L564. The problem is probably that at some point the protocol handler for a message never returns, so the unblock function is never called and it stops processing new messages.

peterfkruger · January 21, 2021, 7:04pm

If we were to intentionally provoke the protocol handler’s call to never return, what could we possibly do in a method?

I’m asking this because this bug seems to affect certain apps and does not at all affect others. There must be something in certain apps that allows this bug to happen. What on Earth could this be?

waldgeist · January 21, 2021, 7:36pm

Thanks @zodern for the info. Is it possible to run mup meteor debug while the server is still running, so I can attach the debugger to the running process? I am asking because the problem only occurs after a while (typically days or even weeks), so I can’t run the prod system in debug mode all the time.

waldgeist · January 21, 2021, 7:52pm

I’m not 100% sure if I understood the code, but it seems as if you mean that line 559 won’t return and thus line 562 is never being called?

I don’t actually see any method in my code where this might happen (but I’ll investigate this further). Could it be that this happens if the handler crashes with an exception and thus the Fiber’s code won’t continue?

Would it help if we called this.unblock() wherever possible? I am using this sparsely, just because I was too lazy to care about it. If I know that parts of a method take longer, I typically call them via Meteor.defer() instead, after the “important” tasks of the method have been performed and the desired result is available.

Besides, I switched from Compose to Mongo DB Atlas some months ago. I noticed that the response times of Atlas are way worse than they are on Compose, even if I access the DB using 3T Studio. Could it be that the connection to Mongo sometimes gets unstable somehow and thus database operations won’t return? (I personally doubt this theory, because the problem only affects the current user session. Once I reload the browser, I can run a method again. Until I try to call a method or sub again, then things get stuck again. The only cure is to restart the whole server.)

Just for curiosity: What’s the reason why methods block by default? Is this because Meteor wants to assure that subsequent operations run in order? Because otherwise I’d say that blocking is only needed if methods rely upon each other, so an explicit block might also make sense.

zodern · January 21, 2021, 8:23pm

Yes, it doesn’t restart the app. It enables the inspector in the running Node process. I added this feature because I had similar problems in some projects - bugs that only happened in production after the server was running a while.

Assuming the method or publication works correctly (it returns, and if it returns a promise, the promise is eventually resolved), there are two things I can think of:

A number of packages wrap or overwrite the functions involved in processing messages and running methods and publications. There could a bug in one of those packages, or a conflict between some of them.
If the method updates a collection, it seems Meteor waits until all publications for that collection have been updated. I doubt there is a bug with this, but it is a possibility.

I’m not very familiar with this part of Meteor, so there probably are many more possibilities.

Meteor handles errors in Methods correctly. As long as the methods return, they shouldn’t be causing this problem. To debug it, I would probably start by trying this:

Add a logpoint to log when a message starts to be processed, and when it is finished
If the logs show specific methods that never completed, add a break point at the end of those methods. When the breakpoint is hit, it will pause the server until you resume it, so usually for apps running in production you would prefer logpoints.
When the break point is hit, step through the code to see what Meteor or packages are doing afterwards, and at what point it gets stuck.