[SOLVED] Poor Galaxy Meteor Performance Serving Small Bursts of Users Load Test

evolross · August 17, 2017, 2:11am

TL;DR

On Galaxy, Meteor seems to only handle about 40 simultaneous logins (i.e. a burst of users) before pegging the CPU at 100% and giving other logins huge, linear-growth delays in retrieving data and loading the initial page. Adding enough servers on Galaxy to handle a bursty load works… but is extremely cost prohibitive and also a huge overkill of processing power once the burst is processed in. Wondering if this is reasonable performance or not and what I can do to remedy.

Full Details

I have a Meteor 1.5.1 production app deployed on Galaxy that is used at large live events with crowds of people. Consequently, it gets hundreds, sometimes thousands of simultaneous bursts of users hitting the app (no Meteor login required). The app for the crowd is super simple, queries a small amount of data from Mongo, and uses Meteor Methods to retrieve the data (because I previously stripped out almost all reactivity trying to solve this problem). There’s also an admin app with very few users that doesn’t suffer the same problems. Over time and growing use of my app, I’ve noticed that Galaxy and/or Meteor do not handle very many users simultaneously hitting the app in a burst and the servers quickly get pegged at 100% causing huge delays in the loading of the app, especially in data retrieval from the database in both pub/sub and Meteor Methods. The performance problems started happening after about forty simultaneous logins. Bursts smaller than forty work okay; the smaller the better. I started investigating the problem thinking it was run-of-the-mill Oplog and reactivity problems first, then possibly Mongo as a whole (which I host on Compose using the old Mongo Classic). I was hunting through Kadira/Meteor APM metrics seeing really fast database performance with a few users, but then really slow database performance with about forty users. I started stripping out reactivity (pub/sub) and replacing with Meteor Methods. I still had problems and was confused for a long time. There’s a very long thread here if you’re curious about me trying to diagnose the problems: https://forums.meteor.com/t/mongo-scaling-issues/27905/25?u=evolross.

I had a large event coming up so I decided to experiment and I changed my containers to Quad size (4.1 ECU) and increased the container count to 12. Low and behold my app ran flawlessly, in seconds, for hundreds of simultaneous users. No Mongo/Compose problems, no Oplog problems. It was the “burst” of users logging in, overloading the small container that was causing all my problems. I never considered that this could be the problem because I’ve seen one Compact Container (0.5 ECU) on Galaxy handle 400+ connections and users at the same time in my app. If the users hit the app in a slow, ramped fashion, a Compact container works great. It’s the burst it chokes on and I’m assuming all the goodies that Meteor has to deliver upon first load to the browser.

What I discovered finally is that if your Meteor CPU gets pegged, all kinds of wacky stuff happens. Kadira/Meteor APM metrics can get misleading, Galaxy metrics can go bonkers (especially connection counts), worst of all, response times go through the roof on a linear scale (see graphs below). Especially having to do with database queries from both Meteor Methods and pub/sub. The initial/static HTML still loads fairly quickly, but the population of data into the page, even with the use of non-reactive Meteor Methods, hangs for an unusually long period of time. I was having users staring at my loading animation for up to sixty seconds. And then usually refreshing many times in between causing more load. I’m not sure why the database data delivery is the hang up and why that starts to run slow versus the HTML, CSS, etc. Those seem to still load quickly even in the burst. And you can also see this phenomenon in my graphs below.

So… I set out to load test my app to test that this “bursting” was actually the problem. I needed to simulate hundreds (if not thousands) of realistic users hitting my app at the same time (read: not ramping up over five minutes - which is unfortunately what a lot of cloud load testers offer when you need to scale up users). This topic deserves an entire thread of its own. Through many days and hours I learned that it’s difficult to properly load test in Meteor. The problem I found is to actually reproduce the problem, I had to have real (or at least headless) browsers hitting my app - and lots of them. So JMeter, Gatling, and a whole variety of web/cloud load testers (even a lot of major providers) were unusable because they only test HTTP traffic. They don’t simulate button clicks calling JS functions and the resources downloading, javascripting, database calling, reactivity, Oplog work, etc. etc. of actually loading your Meteor app. These load testers are usually old school. You can run them on your Meteor app and Meteor performed quite well just serving the HTTP of my app. I ran a JMeter test with 1000 simultaneous hits to my app from my desktop and a Compact container performed great, but that’s only serving the HTTP of the page. Galaxy doesn’t even register these hits as “connections”, but it did show a minor CPU hit. So these types of HTTP tests work, but they don’t come close to actually reproducing the problem.

I found the only way to reproduce the problem was using a cloud service that actually launches browser instances across distributed machines. And headless browsers work fine (e.g. PhantomJS) as long as they process the JS of your page and trigger a proper Meteor “connection”. Headless is better because you can run more browser instances per test machine versus launching real Firefox, Chrome, etc. The problem is this type of load testing takes a lot of horsepower and machines, and this gets expensive quickly. There are cloud load tester apps that will charge hundreds to thousands of dollars per month to perform tests like this. I looked at a lot of them. 95% of them are too expensive for my app. Amazon Mechanical Turk is also too expensive when you need hundreds/thousands of users. I can’t afford $999 per month and I also can’t afford $50 per test. The very best solution I found for my use-case (which I know is kind of a weird edge-case) is www.redline13.com. Their service actually has a free tier that lets you connect your own AWS credentials, spin up your own EC2 instances, and they take care of firing off your tests for you and handling all the behind-the-scenes setup of your EC2 instances to start and run PhantomJS. You just pay for your EC2 usage. They can even load super-cheap Spot Instances and let you re-use them for a whole hour. This makes doing tests of hundreds/thousands of users costs pennies per test. They support a variety of tests, but their Node.js Webdriver integration works fairly well and makes each user connect and you can also script a user-case including buttons, forms, etc. And best of all, all the instances fire off almost simultaneously. Webdriver can do almost anything. And their PhantomJS reports back tons of useful metrics that Redline13 saves for free (see below). Redline13 has some paid plans that involve support and extra features (like test replay and cloning).

If you’re interested in load-testing this way, here’s a link to their tutorial for testing like this:
https://www.redline13.com/blog/2017/02/selenium-webdriver-cloud-performance-testing/ and also a YouTube walkthrough: https://www.youtube.com/watch?v=GWBrfucwBtI (both super helpful).

On a quick side-note, getting Redline13 to work properly took a lot of trial and error. When something is free, there’s usually a reason. So there are some quirks and gotchas (I may write a tutorial about creating a Meteor test and running it on Redline13):

You need to run a powerful enough server to run your PhantomJS instances or risk running into problems and anomalies with your testing machine not having enough CPU to run the test instances. There’s a metric on the Redline13 results under Agent Metrics stats that shows “Load Agent CPU Usage”. Make sure this never pegs at 100%. If it does, not all your instances will run. I recommend an M4.16XLarge (or several of them if you’re testing in the thousands of users). One of these boxes can handle hundreds of PhantomJS instances.
I also recommend having your use-case “do something” like adding a document. That way you can count how many documents were added and verify that the total number of documents matches your instance count to verify all your instances ran as expected. Otherwise it can be hard to tell if they just, for example, hit the URL of your app.
Tests can take a few minutes to spin up. Be patient. Check for errors at the bottom of the results page. You can run more tests using the same EC2 instances you started. They’re available for about an hour. And use Spot Instances, it’s way cheaper.
I could only get testing in PhantomJS to work. Firefox and Chrome are also offered, but both of them returned errors. I emailed Redline13 about this and they said they’re working on fixing it. As I mention above, PhantomJS is more efficient anyway. PhantomJS also occasionally fails to run every now and then, never figured out why.
I also got errors with JMeteor cloud testing on Redline13. However, as I mentioned above, this won’t test your app properly and you don’t really need a cloud to run a lot of users (threads) in JMeter anyway. You can do plenty just from personal machine.

Note: I’m plugging Redline13 a lot because they offer a TON of value for no charge. Would love to help them get out there more.

So, on to the testing. I was indeed able to reproduce my problem. Very easily. The first image is a series of response graphs showing a linear growth in response time up to 60+ seconds of my app when running 50 users on a Compact container. This correlates exactly to real user feedback and my own results using my app along with a crowd of people. Each following test and graph, I increase the container size thus increasing the ECUs. Each test shows better response time and performance.

Here’s another set of graphs showing the Galaxy performance in each test:

You can see how easily a Compact Container gets pegged. Note the final graph that shows a response time of less than 1 second for 50 simultaneous users. So no problems with Oplog or the database. Or at least no problems with the database. Perhaps I’m throwing enough horsepower with a Quad Container to chew through any Oplog issues, but I’m confident I don’t have any Oplog issues because once my users log in after the burst, hundreds of users can use the app with no problem on a Compact Container with Oplog functionality working great. All of these tests are on a single container. I’ve found that these results duplicate in the same way when adding containers. So adding 4 Quad Containers will handle 200 simultaneous users with similar performance as 1 Quad Container handled 50 simultaneous users.

So I’m now able to reproduce my problem. The next logical question I had was “is it something about my app?” Is it too complex, too much reactivity, am I doing something grossly negligent (which is very likely ) etc.? I had a feeling my app was fairly efficient because it’s quite simple and I use a bare-minimum pub/sub that is highly re-used by the crowd using the app. So I thought I would test the latest Meteor todos app. So I cloned it and uploaded the latest version on Galaxy and put the database in the same Mongo Classic deployment as my app on compose.io. This test is slightly different than the above. For my app, I kept the simultaneous users the same and upped the container size just to prove my app could work fast and the database wasn’t the problem. For the todos test I wanted to see if its performance would bog down like my app, so I keep the container the same (1 Compact) and changed the simultaneous users starting with a high user count and lowering it each test. Here are the results:

And the Galaxy results:

So you can see the todos app, quite similarly to my app, bogs down around 25 to 50 simultaneous users returning really slow performance. I could also witness this delay with my own eyes in my own browser. In the middle of the high-user tests, I tried to hit the todos app I was hosting and it would take 30+ seconds to load all the data into the HTML, exactly as the test results show. I did one last test you can see above in the todos app with 100 simultaneous users on a Quad Container and it chewed it up returning all users in less than one second response time per user! So Quad is awesome, but it still boggles me why a Meteor app needs so much power to deliver first page downloads.

Lastly, I thought it might be good to test a brand new, blank, Hello World Meteor application. One with no packages, frills, or even database connections at all. The good news is it has much better performance. I’ll spare you the graphics:

10 Simultaneous Users on 1 Galaxy Compact Container: Under .5 second response time for all users
25 SimultaneousUsers on 1 Galaxy Compact Container: Up to 1.2 second response time on some users
50 Simultaneous Users on 1 Galaxy Compact Container: Up to 1.5 second response time on some users
100 Simultaneous Users on 1 Galaxy Compact Container: Up to 6 second response time on some users
200 Simultaneous Users on 1 Galaxy Compact Container: Up to 12 second response time on some users

This seems like great performance. I would love to get my app and the todos app performing like this. But both of those apps involve the database, pub/sub, Meteor Methods.

In closing, at the end of the day, I have the following questions:

Am I crazy to expect such performance in a simultaneous fashion? Is this performance good/typical/reasonable? Is more than 50 simultaneous users too much to expect on a Compact (0.5 ECU) container? Is 0.5 ECU very little? I don’t know much about ECUs. It seems low considering a Compact container can handle 400+ users in my app once they’ve logged in. In order to handle 1000 simultaneous users I’ll have to scale up to 5 Quad Containers at $317 per month each that’s $1585 per month. Handling 5000 simultaneous users would cost $7925 And all of this would be a massive overkill of CPU once my users have logged in past the burst and are just using the app.
Are there any strategies to handling my use-case? Is the poor performance from all the initial goodies that Meteor delivers on first download? Is there a way to hook in some other kind of server to help with this? I thought about packages like simple:rest to rebuild the front of my app in a REST-driven format, but eventually my users will need to hit the real Meteor app. And they will still be doing so in huge waves. I could try to delay their querying of data or intelligently throttle it. As long as the assumption is true that simple “less-database query-oriented” pages affect the CPU less than normal pages that query a lot. Like maybe I could create a quick-loading landing page and then get them the data from there. I’ll run another load test in my app just loading a basic “Hello” page that doesn’t query data until the user clicks a button… just to get them “into” Meteor without querying data.
Should I rebuild this part of my app in a totally different stack? Something more capable of bursts of users? Perhaps stop using Galaxy and roll my own stack to deploy with? Would switching to AWS help versus using something like Galaxy?

Sorry about the length. Thanks for reading. I thought this could help a lot of people.

elie · August 17, 2017, 3:17am

Great write up and hearing about your experience. This is probably the biggest problem with Meteor and why it has never taken off as it should have. I think it would be great if you could run these tests using Redis Oplog and see if that performs any better. It is a problem that each instance can only handle 50-100 connections max. I’m a little shocked that even the to do app has this issue.

I haven’t studied your graphs enough, but it seems like running many 30 1gb instances might be smarter than running 12 4gb ones. You can ask galaxy to increase your limits.

You should definitely self host. Look into nginx load balancing for meteor. I have a blog post on it if you Google scaling meteor. It will be 100 times cheaper than galaxy which having used it for a month now at similar costs to what you’re describing just doesn’t make sense. Not to mention it’s extremely buggy ui.

Does moving off meteor make sense to solve your scaling issues. It could be. If your app is only used at events and will never hit more than a few k users at a time, then maybe not. If you need to hit really big numbers then yes, you may want to move away from meteor. I wonder if moving meteor methods to rest or graphql would help much. I didn’t think that was ever the bottleneck for meteor apps. Always felt it was pub sub but I may be wrong.

If you do move away from meteor pub sub you lose a beautifully simple system. This is how web development should be. Unfortunately mdg hasn’t worked out a way to scale it.

waldgeist · August 17, 2017, 5:55am

Thanks for the great post! Very interesting.

Did you experiment with different settings for DDP rate limiter?

marklynch · August 17, 2017, 10:19am

Wow. Thanks for the really interesting and detailed post. Would be interesting to try add in the the pieces (db connection, pub/sub, methods) one by one and see which part degrades the performance. Would also be interesting to hear what MDG make of it.

tomsp · August 17, 2017, 11:28am

Yes great post! Super interesting. One question for my understanding: When does the performance problem occur exactly, when A) 50 users hit your URL and they first browse your website, i.e. the initial page is requested for the first time or B) after they have already browsed your website and only when they hit a login-button to log into their accounts all at the same time?

hwillson · August 17, 2017, 12:43pm

This is awesome @evolross - thanks for digging into this so thoroughly! I have a feeling this issue isn’t Galaxy specific. Would you mind opening an meteor/meteor issue about this? An issue will help kick start the investigation process; let’s see what we can do on the Meteor side of things to help improve this.

Quick side note - Meteor 1.6 (Node 8) introduces several performance improvements that should help with this. It won’t fix everything, but it should help reduce your load stats. If you get a chance, it would be great to see how things behave when using the current Meteor 1.6 beta (1.6-beta.22).

SkyRooms · August 17, 2017, 3:35pm

Excellent write up.

Galaxy is fine for small apps that aren’t CPU hungry. But in my case of building an MMO, I was lighting their CPU’s on fire. I ended up having to self host, mostly due to cost.

Their cost is 2x as anyone.

HOWEVER it is important to know I did not have OpLog enabled, so there was no caching at all. I shit on Galaxy pretty hard in the past, but I may have just been an idiot. I’d love to re-deploy my app to their recommend setup (Galaxy + mLab) and see how I’m doing now. I bet it would run just fine.

I do plan to move back to Galaxy at some point once the $$$ starts flowing from my app.

raphaelarias · August 18, 2017, 12:04pm

Have you tried caching static files? For example the big JS file that Meteor has to load at the beggining. Using Cloudflare (with Page Rules, and make sure to test checking the HEADERS if it’s marked as “HIT”) you can automatically cache it, when a new deploy is made it automatically send the up-to-date version.

That way the galaxy instance would handle “only” the methods and pub/sub calls. Because otherwise, in every new user the galaxy instance has to send a massive JS file and all the package files.

I’m not sure if the gains in performance would be good enough or if that’s the bottleneck.

But one thing I’m sure the Compact version is just for “prototyping”. It has a very low ECU.

Does your login has custom validation or does it required a lot of computing to allow/deny access?

Cloudflare page rules example:

Concurrent users just clicking around your app aren’t intensive on the server. Users downloading your app payload for the first time, all at the same time is highly intensive on your server.

[EDIT]: Reading your other topic, I think the Cloudflare option maybe a good solution. You can use Argo (from Cloudflare too, to make it even faster).

XTA · August 18, 2017, 1:11pm

I’m also wondering why people do have a lot of these issues. We running http://www.yabeat.com on a single VPS for about 60 Euro per month (plus 50 Euro for the Mongo VPS), handling about 1.500 concurrent connections (using 4 vCPUs). We are also using the Pub/Sub system for every single user to provide the playlist feature. The only issue we had were bad MongoDB indexes which caused crashes and slow performance. After fixing the issues (f.e. we forgot that the order of the fields is very important), everything started working fine. Oplog is also enabled to provide reactivity through all instances.

We decided us explicit against Galaxy. We want to know what we get (f.e. speed of the CPU). In the case of yB, we are hosting on cloud VPS systems at OVH. We get high availability for a very fair price. Compared to the Galaxy pricing, this seemed more economically to us.

elie · August 18, 2017, 1:22pm

Rate limiter is against DDOS attacks. It will stop users for using the app too aggressively. Not something you ever want to affect regular user actions in the app.

elie · August 18, 2017, 1:27pm

How are you making use of the 4 cores? The site is impressively fast for the amount of traffic you get. Would love to learn more about the architecture

iDoMeteor · August 18, 2017, 1:31pm

Galaxy is not very performant. One only has to look at the line in the docs where they recommend using the smallest instance size that can run one instance of your app. Seriously?

If one instance of a web/mobile app needs 1 full core and 2 GB of RAM to run…something is wrong with your app.

My suggestion is learn how to properly serve Node applications yourself on AWS, which is what MDG is doing. Just not super well.

Also, every app is different. And every version of every app is different. So much depends on your coders and their understanding of how computers compute.

I have my own secrets but I’m saving them for the Meteor hosting service that doesn’t suck that I hope to operate one day…

elie · August 18, 2017, 1:32pm

The responses these forums get sometimes… Wow.

Good luck with your secret hosting

XTA · August 18, 2017, 1:43pm

We are using PM2 (not pm2-meteor) and starting 4 fork instances. Then we have NGINX which provides sticky sessions and distributes the traffic to the instances. To deploy our app, we’ve written a small bash script:

#!/usr/bin/env bash

server="user@www.server.com"
projectFolder="/var/app"
deployUser="user"
meteorServer="http://www.myapp.com"

rm -rf built
cd ..
meteor build .deploy/built --server "$meteorServer"
cd .deploy
mv built/*.tar.gz built/package.tar.gz
ssh "$server" -p 22 "sudo mkdir -p $projectFolder && sudo mkdir -p $projectFolder/upload && sudo mkdir -p $projectFolder/current && sudo chown -R $deployUser $projectFolder && exit"
scp -P 22 built/package.tar.gz pm2.config.js "$server":"$projectFolder/upload"
ssh "$server" -p 22 "cd $projectFolder/upload && rm -rf $projectFolder/current/* && tar xzf package.tar.gz -C $projectFolder/current && cp pm2.config.js $projectFolder/current && rm $projectFolder/upload/* && cd $projectFolder/current && cd bundle/programs/server && npm install --production && cd $projectFolder/current && sudo pm2 restart pm2.config.js && exit"

And the pm2.config file:

var appName = "app";
var appPath = "/var/app/current";
var rootURL = "www.myapp.com";

var settings = '{}';

module.exports = {
    apps: [

        {
            "name": appName + "-0",
            "cwd": appPath + "/bundle",
            "script": "main.js",
            "env": {
                "HTTP_FORWARDED_COUNT": 1,
                "ROOT_URL": rootURL,
                "PORT": 3030,
                "METEOR_SETTINGS": settings
            }
        },


        {
            "name": appName + "-1",
            "cwd": appPath + "/bundle",
            "script": "main.js",
            "env": {
                "HTTP_FORWARDED_COUNT": 1,
                "ROOT_URL": rootURL,
                "PORT": 3031,
                "METEOR_SETTINGS": settings
            }
        },


        {
            "name": appName + "-2",
            "cwd": appPath + "/bundle",
            "script": "main.js",
            "env": {
                "HTTP_FORWARDED_COUNT": 1,
                "ROOT_URL": rootURL,
                "PORT": 3032,
                "METEOR_SETTINGS": settings
            }
        },


        {
            "name": appName + "-3",
            "cwd": appPath + "/bundle",
            "script": "main.js",
            "env": {
                "HTTP_FORWARDED_COUNT": 1,
                "ROOT_URL": rootURL,
                "PORT": 3033,
                "METEOR_SETTINGS": settings
            }
        },


    ]
};

The files are stored within our Meteor project in the .deploy folder.

iDoMeteor · August 18, 2017, 1:57pm

Try Passenger.

My guess is the article poster is maxing his DB.

iDoMeteor · August 18, 2017, 1:59pm

Your post was super valuable? You don’t even know how to scale your pub/sub problems. Pfft.

elie · August 18, 2017, 2:17pm

It’s not. If he uses quad instances all works fine without touching his db size.

elie · August 18, 2017, 2:19pm

A little shocked that only 4 instances handle this amount of traffic. You never hit limits with this setup?

iDoMeteor · August 18, 2017, 2:48pm

I’m sort of thinking this might be something going on in the request/response layer or sessions initialization. Could be a poorly performing encryption algorithm, a bottleneck in the FS temp area, etc… since the problems occur during burst logins it could also be something about the way the accounts system is sending/retrieving/storing authentication keys.

Honestly, this would be a difficult problem to debug without direct access to all of the server metrics and logs.

XTA · August 18, 2017, 4:04pm

Nope, we are using this configuration since December 2016 and everything is working fine - but we can’t run the MongoDB + Meteor app on the same instance, this would hit the limits . But paying 100 Euros for this size of connections is pretty fair. We’ve used PHP + Apache before and had to run 2 dedicated servers.