Scaling MeteorJS > 7,000 concurrent connections

pasharayan · June 23, 2020, 11:31am

Hi Everyone - our startup (https://www.insidesherpa.com) is having issues with scale on Meteor at the moment. We’re on meteor galaxy and we’re getting > 7,000 concurrent users, but our app is becoming unusable and we’re on 20 containers.

The Galaxy team isn’t being super responsive or helpful with us (update: read why here) - so we’d love to get some help to fix this problem. Has anyone faced this issue before?

Feel free to email us at pasha@insidesherpa.com, joe@insidesherpa.com and julius@insidesherpa.com

pasharayan · June 23, 2020, 11:36am

As an fyi happy to put someone on a contractor basis with us to help us solve this!

alawi · June 23, 2020, 12:46pm

Nice app and congrats on the momentum, 7k concurrent session is good crowed, it seems that you’re overwhelming the servers.

Since this is a free app, I think you’ll need to minimize the consumption of resources per session as much as possible.

I don’t see much need for real-time when I used the app, so perhaps you can start by minimizing pub/sub, are these subscriptions really needed?

virtualInternships/getIndex
seoByRouteName
publicProfile/getByUserId
notifications/get
enrolments/getOne

Just quick feedback.

captainn · June 23, 2020, 1:36pm

Echoing what @alawi said, you’ll want to minimize the use of pub/sub as much as possible. I tend too deliver almost all of my data over methods.

You might also look at redis oplog, and see if that can help.

pasharayan · June 23, 2020, 2:09pm

Thanks @alawi and @captainn ! We’ve started using https://github.com/adtribute/pub-sub-lite pub sub lite (as a way of reducing the usage of proper pub sub) and seeing some small wins there now. We’re already using redis oplog, so it looks like the next step for us is to migrate as many pub-subs to pub-sub-lite.

Still looking for other ways to manage performance.

captainn · June 23, 2020, 2:55pm

There are more drastic things you can do too, like switching over to methods for data, then using something like simple:rest to convert your DDP requests to REST requests. Once you do that, you can use cloudfront or similar to cache those requests, and reduce the pressure on your node.js server.

Edit: maestroqadev:pub-sub-lite is a great find! I’m going to start using that immediately for one of my hobby projects!

waldgeist · June 23, 2020, 9:10pm

That’s interesting. I always thought redis oplog would be the mystery weapon to solve all Meteor scaling problems?

BTW: Cool concept. I just enrolled to the NSW startup class, maybe I can learn something

alawi · June 23, 2020, 9:41pm

I think redis-oplog still consumes memory from mergeBox.

There is no bulletproof solution to large scale, memory and CPU will be used at some point, unless the entire infrastructure is outsourced.

With that said, I think they need to convert the unnecessary pub/sub with pub-sub-lite to methods, which looks like a good start.

mullojo · June 24, 2020, 1:29am

@pasharayan it’s hard to give any specific advice without knowing how your app is designed & the choices you’ve made for the processing work your app needs to do. I can give general advice for things that would help everyone’s app. You’ve probably done many of these already, but I’m just listing them for everyone’s benefit and the chance that you might find some helpful.

Here is a list of some things that put load on your server cluster:

JS bundle sending on new client connections & client refreshes (improve by adding a ServiceWorker.js file to cache your JS assets, 1 hr of effort)
Heavy data over pub/sub (DDP) (make sure you are examining the data moving over pub/sub with Meteor APM, try to root cause the biggest resource drains to optimize your app, a typical Meteor app without issues can handle many more concurrent connections)
Multi-user update loops, data change by 1 user causes 7000+ other users to get the change (alter your internal app design if you have this type of thing going on, just be aware of how data & updates propagate through your app, map this out in a visual tool to really make sure you know what is going on)
Using your Server to do CDN type workloads, like hosting images & video files (make sure these bandwidth heavy tasks like image hosting, video hosting, etc. are moved to a CDN of choice)
Over-using the server when you could put some user specific processing on the client alone (the client’s browser can handle a lot of processing that you might consider doing on the server, make sure you balance the workload)
MongoDB queries where too much processing is done on the Server (remember MongoDB has many advanced query types where processing loads can be handled by your MongoDB cluster, make good design choices & optimize where you can)
Not using enough Async/Await code on the Server (make sure you don’t have functions waiting for other functions that are holding up your server processing, optimize this in your app)
Wasteful processing from too many timers driven by client actions (don’t use timing delay functions in your code if you have many clients, this is usually also fixed with proper use of async/await in your app)

These are just some things that come to mind. Little things that would not be noticable with small numbers of users add up to cause issues. Make sure you find the root causes of each issue and just work on them 1 at a time to get your performance gains up.

Meteor with DDP pub/sub is very scalable if it is used thoughtfully.

In my app, I only use DPP pub/sub where I need the features and I use async/await Meteor Methods, which use DDP, but are being run only when specific data is needed. I think this is an ideal approach.

I’m happy to elaborate if you have any specific questions that come to mind. Could you tell us a bit more about what you are using in your Meteor Stack? Blaze, React, Vue, etc.? What your app does when concurrent users are connected? etc.

paulishca · June 24, 2020, 4:40am

" JS bundle sending on new client connections & client refreshes (improve by adding a ServiceWorker.js file to cache your JS assets, 1 hr of effort)" - or just deliver from a CDN:

In your Meteor startup/server

if (Meteor.isProduction) {
  WebAppInternals.setBundledJsCssUrlRewriteHook(url => {
    return `https://your_cloudfront_cdn.com${url}&app_v_=${process.env.npm_package_version}`
  })
}

If you go this way, just mention my name so I can give you the Cloudfront configuration.

juanpmd · June 24, 2020, 5:38am

About your concurrent connections, are you disconnecting them from the server after being idle for X amount of time?

npvn · June 24, 2020, 3:26pm

Hi @pasharayan, I’m the author of pub-sub-lite. It’s a new package so I’m very happy to see early adopters! Thanks for trying out the package and feel free to let me know if you encounter any issue.

Regarding your performance problem:

I would like to echo what have been mentioned by others here about reducing the use of pub/sub in favor of Methods. It seems that you’ve already started going on this path with pub-sub-lite, which is great.
Are you currently sending a large amount of data to each client? If so, is it possible to reduce the size of that data (e.g. filtering only the necessary document fields, doing pagination to reduce data on initial load)?
You mentioned that you app has been “becoming unusable”. Does the app feel slow and laggy on the client-side? Although most of Meteor performance problems occur on the server, you may face them on the client as well. For example, if the client has to process too much data, the UI may become unresponsive. One more thing to look out for is that if you store a large amount of data in Minimongo, the performance may suffer because Minimongo doesn’t maintain any index. It can be more performant to just store your documents in a normal array and use native JavaScript Array methods (find, filter,…) to access them (obviously you’ll lose the benefit of reactive rendering, so this should only be seen as a workaround for edge cases when the amount of data is too large).
Did you notice anything unusual in your Meteor APM? It would be helpful if you can share your APM screenshots with us.

filipenevola · June 24, 2020, 3:46pm

Hi, on Galaxy we don’t provide code level support but besides that we try to help as much as possible, even providing insights at the code level and this has been the case also in your recent tickets.

About your issues, are you comparing your connection metric with Google Analytics or other tool? Maybe you are keeping many live queries for idle clients, a package like mixmax:smart-disconnect could help.

As you are already using redis-log, you could also use redis-oplog fine tuning options https://github.com/cult-of-coders/redis-oplog/blob/master/docs/finetuning.md but as others have said it’s hard to provide specific feedback without knowing your code.

copleykj · June 26, 2020, 2:21pm

Not knowing a whole lot about your application these would be my personal recommendations.

Implement cultofcoders:grapher making use of non-reactive queries where possible to benefit from the performance of their Hypernova engine.
Remove all publishing of reactive counters and replace with denormalized counts, grapher can help with this as well.
As stated by @filipenevola, fine tune redis-oplog by implementing custom channels.

waldgeist · June 26, 2020, 2:40pm

I love the concept of your pub-sub-lite package!

kschingiz · June 26, 2020, 2:47pm

very good advises by @filipenevola and @copleykj
I would also add:

Analyze your db queries: create appropriate indexes, use projections, try to use projections to execute covered queries, read this article https://docs.mongodb.com/manual/core/query-optimization/#covered-query
Use load testing tools with APM to understand why it’s unresponsive
APM tools (on local):
https://github.com/Meteor-Community-Packages/meteor-elastic-apm
monti APM: https://montiapm.com/
Load testing:
https://github.com/kschingiz/artillery-engine-meteor
There was also one load testing tool, but I cannot find it
I have also seen cases where frontend was doing lots of re-subscribes/method calls on each data change, use Meteor dev tools: https://chrome.google.com/webstore/detail/meteor-devtools/ippapidnnboiophakmmhkdlchoccbgje?hl=en
to see why and where you are refetching data
There are also cases when oplog cannot be used in pub/sub, so meteor uses PollingDriver which is very slow, MontiAPM which is based on Kadira will show that pub/subs

Good luck with optimization, I believe Meteor can handle even more connections than 7000+.

cormip · June 27, 2020, 2:59am

I’ve had performance problems on Galaxy which Galaxy Support never adequately addressed. Switched to NodeChef and problems were solved. BTW, lots of good performance optimization suggestions in this thread (many of which I had tried to no avail). NodeChef isn’t problem-free though either as I’ve experienced outages on my NodeChef hosted apps. However, when they run, they run well.

vblagomir · June 27, 2020, 10:29am

Our load tests showed about ~300 concurrent connections per server, before load times skyrocket. Usually minimum size containers are used, so that it is no more than 50% RAM used in ‘idle’ state (in our case it is 512MB RAM containers, but normally 256MB containers are enough for simple application). This is without usage of Redis Oplog (which we want to try soon). This however strangely matches to your 7000 connections per 20 servers (7000 / 20 = 350 connections per server). Now I am interested if this is the maximum physical cap here? Or if larger servers may help?

filipenevola · June 28, 2020, 10:46am

Hi @cormip how are you doing? I believe you are talking about past events (before Tiny acquisition), right? I would be happy to review the issues that you had with Galaxy and offer the trial for you to check Galaxy again.

We have thousands of Meteor apps running on Galaxy, handling thousands of connections without issues.

Please reach me out on filipe@meteor.com or support@meteor.com so I can understand your issues. If they are still happening that is even better so we can improve Galaxy even further

filipenevola · June 28, 2020, 10:52am

I have talked with @pasharayan by voice and he was using a different channel to communicate with Galaxy team and then even simple requests, like increase container limits, were not being received, he was not able to remember what was the channel specifically but I assume it was not the current valid ones. We did a test together sending requests using Meteor website and it’s working as expect, from now on I don’t believe he is going to have these issues anymore, it was a problem in the channel used to reach us and not with our support. And to be clear, the best channel is to send an email to support@meteor.com

But, in the same time, our support was replying many messages from Julius (Insidesherpa CTO) but I understand that when things were burning at their company and then maybe Pasha was not aware of that.

I just want to reinforce here that Galaxy is a very important piece of the Meteor ecosystem and we (Tiny) are doing our best to provide the best experience possible. We have received other the last 9 months a lot of great feedback about our support and service.

I know we have things to improve, we always have, but I’m sure we are providing a very good service here. And, if you are a Galaxy as well and you are not happy, please, send me an email filipe@meteor.com