CPU jumping up like crazy and crashing production server

which OS are you using? Centos or ubuntu? Which kernel?

Ubuntu

Kernel info: Linux Draft 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 14.04?

The CPU is too damn weak. You are running at a 5 minute average of 94% of load. If you want I can host you on my cloud and see what happens. I have a server almost fully empty, and Iā€™m writing a tutorial on this so Iā€™m interested in troubleshooting real cases.

Anyhow here the options are 2:

  1. There is some bad logic which causes CPU spikes. Check where. Most probably making it asynchronous would help.

  2. Itā€™s a memory starvation issue. CPU usually spikes when thereā€™s no memory left, so garbage collector runs like crazy.

The funny thing? it might not be your fault, but someone else VPS is starving the IO.

Try to run htop and iotop during spikes, so you can see whatā€™s bigger, the IO queue or the CPU queue. Donā€™t believe to memory usage only, as it canā€™t tell the full story.

If itā€™s the CPU queue, then itā€™s your code fault. If itā€™s the IO queue, then itā€™s the server fault.

Yeah, Ubuntu 14.04 x64.

No idea what logic in the app causes the spikes. Havenā€™t worked it out with Kadira and I am trying to use unblock on method calls and even publications using the meteorhacks:unblock package.

The load on the server gets big when a lot of people are visiting it.

This is the bad current situation:

changing server looks like a viable option. 20 second of latencys for a pub sub is TOO much. Something funny is going on.

The massive latency only happens when my CPU hits 100%.

Wouldnā€™t just deploying more instances solve all the problems?

unfortunately itā€™s not a guarantee. If the MongoDB is making you wait, or thereā€™s some poor code, even 100 instances will still lag.

It is heavy load at the moment. Getting 300 signups a day and certain times of day itā€™s obviously a lot busier than others. And the week before was averaging 180 a day.

Are you running the SEO package with PhantomJS? Iā€™ve heard of that bogging down the servers to the point of crashing when Google starts indexing.

1 Like

Using spiderable and phantomJS.

The load on the server is definitely coming from heavy traffic that Iā€™m just not able to deal with. When the most users are online the 100% CPU happens. Now itā€™s late for my users so not a lot online again, so the server is fine for now.

are you using clusters for mult core support?

I think itā€™s hard to tell without seeing publications and methods code. Maybe you are doing batch inserts, doing a blocking operation and so on.

2 Likes

Thanks for all the comments so far.

This is what Iā€™ve gone and done and seems like the server is coping the stress much better now. Iā€™ll see in a few hours if things really worked well.

I upgraded the server to a 2GM RAM 2 core DO instance. That by itself did nothing for performance since Meteor wonā€™t make use of the second core.

What I did next was:

1 - added MongoDB indexes for all publications. I had almost no indexes before this (apart from the standard _id indexes).
2 - improved publications code. I make use of the publish-composite package and I was using this.ready in it which I donā€™t think is supposed to be used with that package (although things were working before so maybe Iā€™m wrong).
3 - this is the big one: I refactored the code so that I could make use of multiple cores. There are certain tasks that happen a lot on the site. Itā€™s a draft fantasy football game and a lot of timeouts are set on the site. This didnā€™t work well running on two instances with how the initial code was set up. After the refactor I made use of the meteorhacks:cluster package and did the timeout tasks on one only one of the instances. The traffic is router to both instances though.
4 - I also replaced the mizzao:user-status package and switched it with the The trusted source for JavaScript packages, Meteor.js resources and tools | Atmosphere package since user-status apparently doesnā€™t work with multiple instances yet. (Iā€™m not really sure how the mizzao package is so popular if it doesnā€™t support multiple instances. Maybe I misunderstood something or itā€™s only used in small apps).

This is what the Kadira graphs now looks like. You can see the switch over in the middle:

Happy to hear peopleā€™s thoughts on any of the above.

5 Likes

I ended up writing a post on this here:

Thanks for the help

4 Likes