Meteor “login” method never invokes callback, blocks all future RPC calls

alawi · March 2, 2021, 10:27am

Thanks @typ thanks for insightful.

Actually, now that I recall, I’ve seen something like that happen when using an older version of MongoDB during stress test. The issue there was the MongoDB was locking and I guess so many requests got queued which causes the Meteor server to hang.

My guess is that, either an API/MongoDB Call or some I/O are not returning or not returning fast enough under traffic which eventually exhaust the Meteor threads causing it to get stuck. And if the load persist, the issue gets worst as more requests get queued. For those seeing improvement after using redis-oplog, I think the DB could be the issue here. But it could be something else it seems, like another DDP server or API etc. I don’t know if the Meteor/Node server has some defence, timeout or recovery logic from scenarios like this, I doubt it, unless someone knows more…it seems that once the server reaches this state, it will be dead in the water and will require a manual restart. So I guess as of now, one can only hope for protective measures, i.e. decreasing the chance of hitting this state by ensuring there is enough CPU in the DB and servers, and being careful when making IOs…

typ · March 2, 2021, 5:13pm

@alawi yeah, protective measures have been the only thing which has helped us. Redis oplog has been huge for that and I highly recommend trying it for anyone who hasn’t.

We have some load tests which simulate our common user interactions, so maybe we’ll try setting up the load tests against a staging environment and manually triggering a Mongo failover or configuration change to see if we can force the “stuck state”.

I’ll update here if we go that route as it might lead to a consistent replication of the issue, which I know the Meteor team will 100% need if they’re to even attempt solving this.

alawi · March 2, 2021, 5:44pm

That sounds promising, but perhaps there could be an easier way to get to that “stuck state”. Something like reducing the fibers threads pool size and force locking the DB under stress load…something like that, making it easier to get to that state systematically. Like you said, getting a systematic replication is solving half of the problem. I guess the solution would be a cleanup and recovery logic added to the server once this state is detected, also it is not clear (at least to me) where in the stack things are potentially getting stuck (node? fibers? meteor code?), perhaps someone is more familiar with the code can provide more insight here.