Help troubleshooting server crash

burni13 · March 15, 2023, 4:42pm

Hi,

We’ve had 2 weird incidents this morning in production. One at 8:30 and another around 10:45.

We do run 4 processes on 2 different servers using Docker
out of those 8 processes, 6 are user facing, 2 are just runners (to run long tasks in the background)

On a linux/docker monitoring tool, we’ve seen 4 of the user facing processes starting to take more and more memory at the same time. From around 600mb to around 10gb in a 5-10 min period.

Then the garbage collector kicks in very frequently and the node processes crashes.

The problem is we have no log during that period. The processes stops sending logs to AWS CloudWatch (docker does the sending).

And in MontiAPM, we see nothing strange except the memory is starting to increase.

So we have no idea what’s causing the issue. Is it a user who started to run a very cpu intensive method or subscribe (like a bug)… there’s no way to tell since we have no logs of that period.

Our DB is hosted with Atlas and we see nothing strange on the charts.

In the mean time, we’ve beefed up all of our servers and booted 2 more. Increased the number of processes per server. But we feed the need to understand what happened.

So it just seems like 4 processes going crazy in terms of memory and CPU

Any idea where to look for ?

What tool could help us troubleshoot when it happens ?

Best regards,

Burni

filipenevola · March 15, 2023, 7:26pm

Hi, you could use this option " --heap-prof" to extract a heap profile when your server goes down.

You could also check your logs for some errors, maybe like “query mismatch”. A single parameter to your query with undefined could cause it.

Feel free to contact us if you need more help filipe@quave.dev

Thanks!

burni13 · March 15, 2023, 9:12pm

Hi @filipenevola

Thanks a lot for the tip. We’re putting it in place.

As for the console logs, we do not see anything. On our server, we do have a function that outputs the CPU usage and Heap size to the console every 5 seconds to debug.

What we see is the CPU suddenly goes to 200% without any change to the heap. We see that for 15-20 seconds (displayed 3-4 times) and then, nothing else is outputted on the console until it crashes a couple of minutes later because the heap starts going up to 10gb.

So the initial symptom we see is CPU usage increase.

Is there anything we could add that would display a console log for every method called ?

Thanks a lot!

Burni

burni13 · March 16, 2023, 10:56am

Fillipe,

The server has not crashed yet but we’re simulating the issue in a test system with a loop that just adds objects to an array.

1-So far, we have not had luck with making heap profiles work.

–heap-prof does not save when it crashes like that.

We’ve also tried --heapsnapshot-signal=SIGUSR2 to manually trigger the heap profile and it works as long as the CPU is not at 200%.

2-What is strange is the when we get the real issue, the CPU is at 200% and then memory starts increasing. In our simulation, the CPU maxes at 100% (1 thread). That could be a hint toward finding what would be using 2 threads (I guess?).

Regards,

Burni