We have this issue where a couple of times a month one of our production Meteor server instances start “spinning”. That is, CPU spikes to 100% load and stays there, making the server unable to accept any new connexion or even DDP data. The issue persists until the server is manually restarted, at which point users loose data inputted during the issue.
See attached graphs for example (here one instance running on a core @ 2.6GHz). Problem starts at 15:54 and persists until we restart the server at 16:12. We have tried logging every methods, publish & observeChanges calls, as well as logging active sessions & subscriptions. Nothing seems out of the ordinary and we can’t find patterns from one occurence to the next (e.g. name of last method called on server, users/agents connected, RAM usage, etc.)
How would you find out which part of code suddenly starts consuming all resources, given the problem is so infrequent and happens on production servers only ? Any pointers would be much appreciated. Thank you.