Mystery - App suddenly stopped working in production (mostly) then as suddenly started working ok again. Why?

No updates to the code, all of a sudden our Meteor app has stopped working in production.

Running the same code, connecting to the production DB from our local dev machine is running fine.

I’m at a loss on how to debug this/what has gone wrong.

Have this on the console but have figured out that that message is being caused by a helpscout widget (probably because the rest of the site is not leading).

The only clue I get is from the console:

GET https://www.app.com.au/efcf50a74ca8288ef635696abfd391cff690b3e2.js?meteor_js_resource=true net::ERR_HTTP2_PROTOCOL_ERROR 200

Anyone have any pointers on what may be the issue or how I should go about debugging this?

Thanks in advance! Hard to keep calm when something like this happens.

Update - Console message not helpful

Looks like there are multiple reasons why it can happen:

My guys at Astraload can take a look it for you if you want.

@eluck

Thanks but we dug in and realised that error is by the Helpscout widget :confused:

Just to clarify, the error is still there and our app is still down, just that the console error that may have been a clue is being generated by a third party.

We’ve been trying to debug (unsuccessfully)

Updates on some weird things we have discovered:

  • Number of connections has dropped to about 25% of normal load
  • Open connections are able to continue using the application successfully
  • New connections have sporadically (in the last few minutes) started to work, but very slowly
  • RAM and CPU usage according to Galaxy has been very low (to the point where it keeps trying to scale down while we manually scaled up in case it was a load issue)

Still at a complete loss as to what is causing this.

I have contacted Galaxy support in case this is something on their end but I think there would have been other threads here by this point if that was the case.

Another update:

App is suddenly back to its normal performance levels. I.e. - Loads instantly vs waiting about 60 seconds (after which it may/may not load).

Again nothing has changed?

The only two theories I have are:

  1. There was some kind of bottleneck that Galaxy CPU/RAM stats don’t show
  2. There was something malicious going on?? Database events were flat until the incident started after which they dropped (because a lot of users couldn’t connect I assume)

I’m still lost for any concrete information/ideas on how to figure out what may have happened.

Hi @hemalr87 I suggest you to send an email to our support team at: support@meteor.com

Hi,

Any new information on the cause ?

How long did it last for ?

Nope. I’m stumped.

Went for 5ish hours.

We’ve had a similar problem once. Ours was not hardware related.

It was difficult to detect because it was a single client calling thousands of methods per minutes non-stop and that method did not touch the DB so it had no impact on the DB whatsoever. Only impacted the CPU and RAM of the server.

Our only bandaid was to call the customer and ask them to close their browser. How shameful that was. :wink:

The long term solution was to fix the bug and also implement DDPRateLimiter in case there were other similar bugs in the code.

If you find anything. Let us know. It’s always good to share those.

burni

1 Like

It was difficult to detect because it was a single client calling thousands of methods per minutes non-stop and that method did not touch the DB so it had no impact on the DB whatsoever. Only impacted the CPU and RAM of the server.

Interesting. We have rate limits on all our methods so I do not think it’s that but not 100% sure. Did you rate limit your method calls?

Our only bandaid was to call the customer and ask them to close their browser. How shameful that was. :wink:

Couldn’t you have logged them out?

If you find anything. Let us know. It’s always good to share those.

Definitely.

My latest thinking is that it isn’t something malicious but likely something to do with subscription count functions causing a memory leak. It’s just a hunch though and doesn’t fully make sense because the memory of the container in question should then go up to the 100% mark which none of the running containers did while the issue was going on.

We did implement method limiting after that incident.

And we could not log the user out because we’re not using Meteor’s user management for historic reasons. We did create a way to log them out afterward :wink:

Another interesting event we had this year was memory related. We do not use Galaxy but our own instances on AWS with multiple process running on the same machine. Each for a different traffic type. What happened is we did move too many users to the same process at the same time. That process memory increased but was still way below the instance’s memory.

What we understood later is node sets a memory limit for a process and as you approach it, starting at something like 70% of that setting, the garbage collector runs more frequently and takes more time to run. Towards 95%, the GC can take 2 minutes to run and take 500% CPU and then node crashes with an Out of memory error.

That does not seem your case as you would have seen that from the CPU/Memory charts in Galaxy.

Do you use an APM to monitor more precise metrics (time per method, time per subs, how many subs, etc) ?

We use MontiAPM that it works really well to identify bottlenecks.

Regards,
Burni

  1. How long was the app running
  2. What version of Meteor are you using?

I asked because there was a thread related to a memory leak in Meteor 2.8

  1. How long was the app running

You mean how long was it running properly before this incident happened? It’s been 2 weeks+ since the previous deploy and years of stable app running in general. No major code changes in about 18 months.

  1. What version of Meteor are you using?

2.6.1 - so not related to the memory leak in question but thanks for the heads up, I’ll keep that in mind for when we upgrade.

Interesting, and thanks for sharing! This kind of stuff can be very helpful.

I have had the memory issue you mention before, for another application, and remember setting the garbage collector manually to resolve it. And you are right, that showed very clearly on the Galaxy charts so not the case here for this incident.

Do you use an APM to monitor more precise metrics (time per method, time per subs, how many subs, etc) ?

We use MontiAPM that it works really well to identify bottlenecks.

We are subscribed to Meteor’s APM although I haven’t found anything by analysing the data from there. Admittedly, it is a weakness in me and something I need to work on.

On that note, if any Meteor guns are reading this, I would happily pay for some detailed tutorials on using APM. I know Arunoda had some articles from when he originally started Kadira but I’m not sure how up to date those still remain and I think this kind of thing, in more depth, would be very helpful.

The guys from Meteor can look at your APM and give you guidance

We (Astraload team) have a ton of experience with Monti APM (previously – Kadira).

When it was opensourced by Arunoda we maintained our own version of it for a couple years, adding new features: GitHub - astraload/meteor-apm-server: Source code to Meteor APM. Eventually we dropped it in favour of Monti APM.

So please send direct messages if you need help with Monti APM or Meteor webapp performance, which is our speciality. (We even created a load-testing SaaS for Meteor and GraphQL: https://app.astraload.com)

Yup spoke to them as well. No insights so far but will update here if anything comes up. :+1: