[SOLVED] Meteor server memory leaks / profiling

bratelefant · June 9, 2022, 2:09pm

Hi all,

I have a meteor 2.7.2 app deployed on an aws instance, which is used constantly during the day, but simultaneous sessions rarely exceed 40-50. However, server memory is constantly increasing during a day from ~230mb up to ~512mb or even more (currently I’ using --max-old-space-size=512 --gc_interval=100 node options for testing purposes); when 512 mb limit is exceeded, gc kicks in, tons of fibres are activated (~4k-5k), cpu is going up like hell, lags all the way.

Frontend is react + mui; I assume that subscriptions get automatically stopped when components with useTracker are unmounted.

I’m using Monti APM for monitoring. Only thing correlated to rising memory consumption are “Active Handles” (System), rest seems normal to me.

I really don’t understand, how all these things could be related, due to lack of knowled of nodejs stuff. Questions I have:

1.) What could cause sudden massive fibre generation? I use this.unblock() on most methods, tried it temporarily on subscriptions, but I’d not suspect that user actions could have causes this.
2.) Is gc related to fibers?
3.) What does “Active Handles” (in Monti APM → System) really mean? Is this the same as handles coming from subscriptions or is this a node <-> mongo thing?
4.) Is there another option for server / nodejs profiling different from Monti APMs server profiling?

Sorry, many noob questions, but any hint is highly appreciated.

EDIT: Found the tooltip in Monti APM explaning “active handles” a bit (…actually, in Safari the tooltip’s not showing xD), so I’ll have to look at open and not closed file handles, connections, etc. – are there any “usual suspects” to look at?

bratelefant · June 10, 2022, 3:31am

Ok I guess I found the leak; some ldap clients on the server were never destructed and kept adding up. However, questions 1,2 and 4 still remain.

bratelefant · June 11, 2022, 7:08am

For later reference: Uncleared Intervals in custom publications are also a thing…

bratelefant · August 13, 2022, 7:37am

Although I thought this was solved I’d like to reopen this thread. I was able by what I’ve posted above to reduce memory leaking quite a bit.

Anyway, I still facing the phenomenon that during a day (several session by users) the number of “Active handles” stays on a higher level (~80-90 sometimes), alongside with memory usage. And this will not drop until I reboot my containers (then active handles are at around 27-28).

wtfnode shows that most of the open handles are Server connections, cf. below. The xxxed out server is an LDAP, the rest seems to be local mongodb connections.

My question are:

What are possible causes for meteor to keep these connections open?
Is this normal behaviour by meteor?
If yes: What’s the limit? And:
Can I somehow manually keep track / close unneeded connections to free some memory?

I’d be really grateful for any hint!

 Servers:
2022-08-13 09:03:49.477339507 +0200 CEST [web-2] - 172.17.0.111:26762 -> 10.0.0.128:35636
2022-08-13 09:03:49.475801282 +0200 CEST [web-2] - 172.17.0.111:45206 -> x.x.x.x:636
2022-08-13 09:03:49.474914402 +0200 CEST [web-2] - 172.17.0.111:44206 -> 10.0.0.25:37999
2022-08-13 09:03:49.474124152 +0200 CEST [web-2] - 172.17.0.111:44202 -> 10.0.0.25:37999
2022-08-13 09:03:49.473322010 +0200 CEST [web-2] - 172.17.0.111:44194 -> 10.0.0.25:37999
2022-08-13 09:03:49.472609090 +0200 CEST [web-2] - 172.17.0.111:26762 -> 10.0.0.48:56456
2022-08-13 09:03:49.471857712 +0200 CEST [web-2] - 172.17.0.111:26762 -> 10.0.0.141:58640
2022-08-13 09:03:49.471168109 +0200 CEST [web-2] - 172.17.0.111:39080 -> 10.0.0.25:37999
2022-08-13 09:03:49.470335315 +0200 CEST [web-2] - 172.17.0.111:39046 -> 10.0.0.25:37999
2022-08-13 09:03:49.469494542 +0200 CEST [web-2] - 172.17.0.111:33400 -> 10.0.0.25:37999
2022-08-13 09:03:49.467976080 +0200 CEST [web-2] - 172.17.0.111:33394 -> 10.0.0.25:37999
2022-08-13 09:03:49.467260267 +0200 CEST [web-2] - 172.17.0.111:33370 -> 10.0.0.25:37999
2022-08-13 09:03:49.466410955 +0200 CEST [web-2] - 172.17.0.111:33352 -> 10.0.0.25:37999
2022-08-13 09:03:49.465636330 +0200 CEST [web-2] - 172.17.0.111:33340 -> 10.0.0.25:37999
2022-08-13 09:03:49.464789173 +0200 CEST [web-2] - 172.17.0.111:50222 -> 10.0.0.25:37999
2022-08-13 09:03:49.464004508 +0200 CEST [web-2] - 172.17.0.111:50210 -> 10.0.0.25:37999
2022-08-13 09:03:49.463136698 +0200 CEST [web-2] - 172.17.0.111:50206 -> 10.0.0.25:37999
2022-08-13 09:03:49.462268680 +0200 CEST [web-2] - 172.17.0.111:50196 -> 10.0.0.25:37999
2022-08-13 09:03:49.460055115 +0200 CEST [web-2] - 172.17.0.111:55848 -> 10.0.0.25:37999
2022-08-13 09:03:49.459172414 +0200 CEST [web-2] - 172.17.0.111:55840 -> 10.0.0.25:37999
2022-08-13 09:03:49.457973604 +0200 CEST [web-2] - 172.17.0.111:55832 -> 10.0.0.25:37999
2022-08-13 09:03:49.456964781 +0200 CEST [web-2] - 172.17.0.111:55826 -> 10.0.0.25:37999
2022-08-13 09:03:49.456258990 +0200 CEST [web-2] - 172.17.0.111:55810 -> 10.0.0.25:37999
2022-08-13 09:03:49.455482942 +0200 CEST [web-2] - 172.17.0.111:55798 -> 10.0.0.25:37999
2022-08-13 09:03:49.454680597 +0200 CEST [web-2] - 172.17.0.111:55790 -> 10.0.0.25:37999
2022-08-13 09:03:49.453914027 +0200 CEST [web-2] - 172.17.0.111:55782 -> 10.0.0.25:37999
2022-08-13 09:03:49.452069079 +0200 CEST [web-2] - 172.17.0.111:55774 -> 10.0.0.25:37999
2022-08-13 09:03:49.451211030 +0200 CEST [web-2] - 172.17.0.111:55764 -> 10.0.0.25:37999
2022-08-13 09:03:49.450540856 +0200 CEST [web-2] - 172.17.0.111:55752 -> 10.0.0.25:37999
2022-08-13 09:03:49.449727345 +0200 CEST [web-2] - 172.17.0.111:55746 -> 10.0.0.25:37999
2022-08-13 09:03:49.449010676 +0200 CEST [web-2] - 172.17.0.111:55732 -> 10.0.0.25:37999
2022-08-13 09:03:49.448217063 +0200 CEST [web-2] - 172.17.0.111:55722 -> 10.0.0.25:37999
2022-08-13 09:03:49.447414125 +0200 CEST [web-2] - 172.17.0.111:53352 -> 10.0.0.25:37999
2022-08-13 09:03:49.438293346 +0200 CEST [web-2] - Sockets:
2022-08-13 09:03:49.442220879 +0200 CEST [web-2] - 172.17.0.111:53314 -> 10.0.0.25:37999
2022-08-13 09:03:49.443952701 +0200 CEST [web-2] - 172.17.0.111:53336 -> 10.0.0.25:37999
2022-08-13 09:03:49.443163270 +0200 CEST [web-2] - 172.17.0.111:53320 -> 10.0.0.25:37999
2022-08-13 09:03:49.444757112 +0200 CEST [web-2] - 172.17.0.111:53344 -> 10.0.0.25:37999
2022-08-13 09:03:49.446666906 +0200 CEST [web-2] - 172.17.0.111:53350 -> 10.0.0.25:37999

rjdavid · August 15, 2022, 4:17am

One possible case is idle clients still holding the connection (although modern browsers must have disconnected after 5 minutes - old browsers still exist). Once we detected that the browser tab was no longer displayed, we issued a 60 seconds timer and explicitly called Meteor.disconnect() from the client. We issue a Meteor.reconnect() once the tab is visible again

bratelefant · August 17, 2022, 7:24am

Sounds like a promising approach, although I assume that most of our users are on apps. @rjdavid How do you detect a non-displayed tab? Via document.hidden?

rjdavid · August 17, 2022, 7:41am

Use the events resume and pause for mobile apps to disconnect and reconnect. Apps, if I remember correctly, hold connections even in the background.

paulishca · August 17, 2022, 1:36pm

Hi @bratelefant,

this is the relevant code fore “smart disconnect”. At the commented link you also have the listeners for Cordova.

import { Meteor } from 'meteor/meteor'
let disconnectTimer = null

// TODO update from https://github.com/mixmaxhq/meteor-smart-disconnect/blob/master/disconnect-when-backgrounded.js for Cordova

// 60 seconds by default
const disconnectTime = ((Meteor.settings?.public?.disconnectTimeSec) || 60) * 1000

const removeDisconnectTimeout = () => {
  if (disconnectTimer) {
    clearTimeout(disconnectTimer)
  }
}

const createDisconnectTimeout = () => {
  removeDisconnectTimeout()

  disconnectTimer = setTimeout(() => {
    Meteor.disconnect()
  }, disconnectTime)
}

const disconnectIfHidden = () => {
  removeDisconnectTimeout()

  if (document.hidden) {
    createDisconnectTimeout()
  } else {
    Meteor.reconnect()
  }
}

Meteor.startup(disconnectIfHidden)

document.addEventListener('visibilitychange', disconnectIfHidden)

bratelefant · August 17, 2022, 3:50pm

@paulishca that’s really great, thank you! I’ll test this and check monti apm if behavior changes. Indeed, on my iOS devices my apps listed with a noticeable amount of background activity, which may be related.

rjdavid · August 17, 2022, 8:35pm

That old code of smart-disconnect no longer works because of the updates in the Page Visibility Api. That code will not reconnect successfully if the hidden tab was frozen.

Read my post here about that package

bratelefant · August 19, 2022, 8:23am

Thanks for all the great replies to my questions. I can report that in my production env above package had a measurable impact on limiting open handles and memory usage. I assume that, even if chrome might not supported, mobile devices seem to disconnect after the timeout resulting in less open mongodb connections.

paulishca · August 19, 2022, 9:35am

Do not worry, it is supported. I’ve been using it for a couple of years with no issues.

Also if you want to display a bar or something in the UI for the users, I do this for mobile (Cordova). This is React but can be adapted to any frontend for sure. It tells the user … stop pressing all those buttons cause you are not even online.

With a similar component, you can let the user know that the session has been disconnected (by the smart disconnect) and hey… you are back online. When you come back to the tab (your situation now is that you have been disconnected by smart disconnect), it may take 1-2 seconds to reconnect once the window is in focus. However, if it takes longer than CONNECTION_ISSUE_TIMEOUT, you can show in the UX that the user is disconnected. The point here being that you can “invisibly” disconnect without the user knowing and once in focus they will never know they’ve been disconnected, or you can let the user know that the session is now reconnected, or you can let them know … pretty much every state.

import React, { useEffect, useState } from 'react'
import { useSelector } from 'react-redux'

const CONNECTION_ISSUE_TIMEOUT = 3000

export default function Connected () {
  const { connected } = useSelector(({ user }) => user) // this comes from Meteor.status().status
  const [showConnectionIssue, setShowConnectionIssue] = useState(false) // local view state

  useEffect(() => {  // similar to onRendered in blaze
    setTimeout(() => {
      setShowConnectionIssue(true)
    }, CONNECTION_ISSUE_TIMEOUT)

    const handleChange = () => {
      if (window.navigator.onLine) {
        Meteor.reconnect()
      } else {
        Meteor.disconnect()
      }
    }

    window.addEventListener('offline', handleChange) // add listeners
    window.addEventListener('online', handleChange)
    return () => {
      window.removeEventListener('offline', handleChange) // remove listeners
      window.removeEventListener('online', handleChange)
    }
  }, [])

  return <div>
    {showConnectionIssue && !connected &&
      <div style={/* some style */} className='animated bounceUp'>connecting ...</div>}
  </div>
}

rjdavid · August 19, 2022, 9:36am

You just need to confirm the behavior. The worst case scenario is having a user trying to use your app again and he is left with a non-functioning app because the app is already disconnected. That will be a bad experience. One way to confirm, is put the app/browser in the background and return to it after 10mins. Check if the functionalities are still working

rjdavid · August 19, 2022, 9:39am

@paulishca unfortunately, you have users who will not be able to reconnect after the tabs are being frozen. Have confirmed that for multiple apps using meteor.

paulishca · August 19, 2022, 9:40am

you must be doing something wrong :).

rjdavid · August 19, 2022, 9:42am

Haha. Good luck to your users

paulishca · August 21, 2022, 5:57pm

Could your experience be related to this?! In fact I try to keep my projects as publication free as possible. Initial implementation of DDP resumption by znewsham · Pull Request #11559 · meteor/meteor · GitHub

rjdavid · August 22, 2022, 5:11am

It is about the difference between the Page Visibility API and the Page Lifecycle API

The Page Visibility API only uses the event visibilitychange. The Page Lifecycle API added several other events: resume, freeze, pagehide, pageshow, etc.

And with the recent issues with browser hogging memory, browsers started to freeze and discard tabs to save memory (also used with bfcache). A frozen page can go to active status without going through visibilitychange event. In this case, the app is already visible to the user but Meteor.reconnect() is not being called because visibilitychange event is not called.

The solution requires a combination of handling resume or pageshow plus checking the status of the page during these events.

Page Life Cyle as implemented by chrome

paulishca · August 22, 2022, 11:46am

Ok, I thought you were referring to something more like data inconsistency or not reconnecting subscriptions etc. Ok, to be frank I don’t seem to have a problem with the browser API so far (nor Cordova).

bratelefant · August 22, 2022, 2:50pm

I did some testing. Could not produce any reconnection issues with revisited tabs in chrome or safari so far. Memory usage is now much better, however, on longer running hosts, open db connection still pile up a bit (up to ~40), which may be related to some older clients that don’t hot code push anymore because of outdated clients from the app/playstore.

EDIT: last but not least: thanks for the great support. All answers are really valuable (literally; I can potentially scale my containers down due to better memory management).

This type of connectionmanagement is one of several features that I’d really appreciate to be core to meteorjs.