Unhealthy app when memory is at 35%

andyj · June 18, 2020, 6:34am

When the memory usage in our goes above 350 MB its shows that the container is unhealthy but we have 1 GB RAM plan. This kind of memory allocation is normal for our application.

Memory usage details when shows the error message : rss: 388 Mb, heapTotal: 301 Mb, heapUsed: 276 Mb, external: 26 Mb

Is there anything else I should be looking for?

stolinski · June 18, 2020, 7:16pm

I’ve been seeing this quite a bit myself lately. Not sure how to troubleshoot.

dkoo761 · June 22, 2020, 11:37pm

This has been happening to me as well, a few times a day every day lately. It just happened a few minutes ago and I quickly jumped into Galaxy to see if I could spot anything.

When I look at the “1 hour” view of the CPU graph in Galaxy, I don’t see any issue and CPU never goes past 40%, but when I switch to the “5 min” view of the same graph, I see that a 2nd DDP connection came in and then the CPU spiked to 100% for a short while. So I think the “5 min” view is the only one that would have a small enough granularity to notice a short spike (if you can get a look before it’s gone from the graph). In my case, it seems like perhaps something that 2nd user did on the site caused it.

A few things you could check:

If you can see the spike in the 5 min CPU graph, you can check the other graphs (Connections and Memory) to look for a cause. Perhaps a spike in users clicking a link in a scheduled email around the same time or something like that. Or something is causing memory to spike which is in turn driving the CPU up.
Check the Galaxy logs for anything unusual
Try using APM or Kadira to investigate further
If a whole bunch of data got written to the DB in a short period of time, the Meteor server might be struggling to catch up with oplog tailing and pegging the CPU
Perhaps something like collection-hooks or publish-composite are triggering a bunch of queries you’re not aware of or forgot that they existed
If you recently upgraded to a new version of Meteor or upgraded some packages check to see if there are any bugs reported (or bug fix releases) for those versions that address CPU spikes
Check for cron jobs that might be doing a lot of processing

dkoo761 · June 22, 2020, 11:47pm

@andyj It seemed like only a very short CPU spike that caused the unhealthy alert, maybe 10-15 seconds at 100%, so is it possible your memory usage didn’t quite grab the right reporting window or was averaging over say 30 seconds and therefore it didn’t register? I’m thinking about Activity Monitor on Mac for example which only updates every 8 seconds.