I’m trying to get to the bottom of an ongoing issue we have with one of our apps.
At seemingly irregular intervals (sometimes every few weeks, other times up to a couple of times a day), a Meteor method will not complete.
To the end-user, this normally appears when trying to “submit” a button - an insert or update for example and then a spinner will display but the method will just seemingly “hang” - ie: no error is returned and no return value either.
If the page is manually refreshed it appears that the operation has in fact been successful but the end users are having to manually refresh the page each time.
I should say that restarting my containers (I’m currently using Scalingo PAAS) immediately solves this problem and yet there are no metrics indicating any memory or CPU spikes or issues. In fact levels are fairly comfortable.
Has anyone experienced anything similar or could help determine if this lack of reaction is most likely at the code, server or DB level?
Assuming that it is known to you which single method is affected, can you give us an outline of what it does, technically, and how it is done? Are there any peculiarities, outgoing API calls or anything else out of the ordinary, everyday bread-and-butter mongodb CRUD thing? Can it be for example that it in fact returns a Promise which sometimes, due to a bug maybe, neither resolves nor rejects?
Thanks for your reply. You make some good questions and I think I could have been clearer. This is happening with seemingly all methods that involve database operations, so not just on one particular method.
Also there are no third party APIs involved, just, like you say, regular MongoCrud stuff
Is there anything in the logs about errors thrown either on the server or on the clients? Do you use any error logger service on the client in the first place?
I urge you to introduce error logging on your app’s client side. Right now you simply can’t know what’s going on on the client in terms of stability, often thwarted by the strangest errors on various devices and browsers. Also highly recommended to review the usage of react error boundary in your app (assuming you use react). If not set up correctly, sometimes even a trivial error can disable your entire application.
This suggests that the error is on the client side; when you restart your containers, your clients get restarted too. My guess is that your client apps keep running for a while after the restart until more and more of them run into the undetected error and stop working again.
EDIT: sorry, I misread your last anwer. Are you saying that you have client side error logging and nothing is logged there either?
Thanks again for a really helpful answer. I’m not using any particular client side logging, would you be able to advise on any solution in particular. So far when the problem has arisen, I have been informed and I can replicate the problem on my own connection with no errors appearing in the browser console.
It’s also triggering a problem for all users, my understanding is if it were a client-side issue it would only affect the user(s) who has triggered that particular bug/issue?
Thanks for your continued, helpful advice
EDIT: Edit to say that I’ve now set up an instance of this app on a different provider (NodeChef) to see if the configuration makes a difference. This setup also includes Meteor APM, not sure if that will reveal any potential problems
That’s good; if I’m not mistaken, Meteor APM also comes with a built-in facility of error logging. Let’s hope that this setup turns up errors either on the server or on the client.
Other than Meteor APM’s own error logging there are plenty of frameworks available, just google “javascript error logging”. Some are for free, others are paid but have a free tier, others are paid only. I however can’t advise you on which one is good, unfortunately.
I have my own ideas of error flow, which go far beyond the mere catching and reporting of all errors, and comprise the collection of relevant context data of the respective error situation. Whereas all frameworks I am aware of only ever deal with catching, transporting and displaying the mere errors caught.
We don’t know yet; it may just as well be that at some point one of the methods returns some data, either correctly or in error, that, once that happens, causes an error on the client, be it even a very trivial one, which knocks out all clients one by one once they get there.
Can’t it be that a piece of data that all or most of your clients get to load sooner or later via one of your methods got corrupted somehow, and this returned corrupt data is knocking off the clients? Then when you restart your containers, it takes a while for each client to load and process that piece of data again.
Thanks Peter, hopefully the APM will turn up something for sure. You’ve given me some really good ideas for trying to track this down further so I’ll try and debug further!
I’m not aware of much shared local state between multiple users that would be rest on page refresh that would be causing this issue hence why I was wondering whether it could ba a database CPU/RAM issue.