[SOLVED] Poor Galaxy Meteor Performance Serving Small Bursts of Users Load Test

evolross · August 18, 2017, 11:37pm

A couple more thoughts and clarifications:

@elie @tomsp When I say “simultaneous logins” I’m talking about “first time loads” of my app, at the exact same time (+/- a few seconds), with users who have never loaded the app before. And I’m not actually talking about a Meteor.login call because my app doesn’t require a Meteor.login. Similar to todos, a user hits my app and there’s some default data that loads based on the URL they entered. Definitely not the same as “concurrent users”. As far as a lot of concurrent users, on one Compact container my app can actually load 300 users, slowly, and it works fairly well. Here’s a graph demonstrating…

Though, I found about 300 users was the max before the Compact container crashes when all my test users do something at the same time (e.g. create a document) - and this simultaneous behavior is typical of my app’s usage. This isn’t bad for one Compact container. In other tests I was able to load 650 users on one Compact container with plenty of memory, CPU, etc. If I weren’t having these test users all do something at the same time in the test, the container would probably handle that many users fine (that’s my theory at least - should probably test this).

I was thinking that all this may not actually be that poor of performance. Handling a burst of 40 first-time hits to your app at the same time isn’t bad for a Compact (0.5 ECU) container. The typical app (that doesn’t have my crazy use-case OR doesn’t have some publicity event that triggers a bunch of instant usage) would likely never hit 40 simultaneous first-time downloads until the app had a much, much higher amount of traffic. Your simultaneous first-time download count would be much lower at any given time - i.e. it would take having thousands of organic users to get to 40 simultaneous first loads within a few seconds. So if you ratchet up to a Standard or Double container, you’re going to be fine for a long time. Well, as long as you don’t end up on Product Hunt, Slashdot, etc. (which has happened to users on this forum and their app consequently DOS’s hence this thread).

@hwillson There is still perhaps an issue here. I blew up a graph of the Redline13 Average Response Times for the todos app with all the various assets that get downloaded (HTML, JS, CSS, fonts, images, etc.) They all load really quickly except for the one JS file. That’s the outlier. I was mistakenly thinking there was more than one Meteor JS file that gets downloaded, but as @raphaelarias said, there’s a single big one. The issue is when I have this “burst of users” all those assets load very quick, except the one JS file. You can see them in the below graph (and in my above graphs). All those colored lines as the bottom are those various assets. And you can see, they’re still loading fast in the graph compared to the one main JS file that’s the outlier growing linearly in delay. With that said, the JS file is much larger than the other assets at 541KB (most of the others are less than 20KB and some are only a few bytes). This JS file matches exactly to the delay you can witness when loading the app. That delay is the population of data into your app templates. I should say, it’s the completion of loading of data into your templates because you can see in your app (both in my app and todos) that the data slowly loads, template by template, and populates over the thirty to sixty seconds. How Redline13/PhantomJS knows that the data hasn’t loaded and how that relates to the JS file - I don’t know. That’s for the MDG devs. I don’t know if that JS file simply doesn’t load entirely until the data is ready and/or it’s waiting for some return value from that JS file? This delayed behavior is the issue that I would report. Why does everything else load fairly fast but the data? Even under major load? It’s not the database @iDoMeteor because when you crank up CPU, the data loads super fast as expected. Perhaps it’s something to do with the “data querying and deliverying” process getting blocked by the Meteor server trying to download all those 541KB JS files to all the simultaneous users. If it weren’t for this outlier, the load times of all the other assets (colored lines at the bottom of the graph) are actually fairly reasonable given the simultaneous hits. These smaller asset response times are more in line with what I would expect.

@raphaelarias The Cloudflare caching is promising, especially given the above. I will definitely try and test. So the idea is that you identify all assets (HTML, JS, images, etc.) and have Cloudflare cache these and deliver them before your Meteor server gets the chance? It seems the only downside to this is when you update your app, you’ll have to wait for Cloudflare to re-cache your JS asset right? Have you tried this and it works for you? AWS CloudFront can do something similar right? It would be cool if there were some way to tell your Meteor app about your own CDN to deliver the big JS file and not rely on a middleware service to intercept the traffic, but that’s cool if it works.

@XTA Your hosting setup looks really cool. If I find I will need to scale way up, I’ll probably have to move off of Galaxy for costs reasons. I do like Galaxy though for its metrics feedback (and now with Meteor APM especially) and the ability to scale up container counts with one click is amazing. And helpful in my use-case because I can ask clients when their large-audience event is happening and then I can just scale up containers temporarily to serve the crowd. The problem is this is annoying, non-scalable with a lot of events, and just generally prone to forgetting and/or leaving a ton of Quad containers turned on. But it works for now, until I figure out how to get better performance with less CPU power. What country is OVH in? Do they use their own cloud or are they built on top AWS, etc.? Is it easy for you to add more “containers” if needed?

@waldgeist Yeah I don’t think DDP Limiting would help here because it’s for per connection rate limiting which isn’t my issue.

raphaelarias · August 19, 2017, 7:15am

Yes. The good thing about Cloudflare vs Cloudfront for example is that Cloudflare does it automatically for you. If they don’t have it on cache, they fetch from your server and next time CLoudflare will serve the file, not your server. Your server in this case would have only to handle the HTML (I think you can force Cloudflare to cache HTML) and the DDP connections.

For optimal caching, set the Page Rules (required), use Argo (faster caching due to server proximity) and the Pro plan, which will give decrease missed requests (when Cloudflare does not serves the file). In general should be below 30 dollars per month if you use both Argo and the Pro plan, otherwise it is free

Yes. You can, after the deploy, access your app, to force Cloudflare to re-cache it (if you have no Hot code reload and no user connected). And after it’s cached (it takes only one user to cache it), you are good to go.

We use it in production and made our app faster. As now it’s the fast CDN of Cloudflare serving static files not Galaxy (the way it should be).

Not as transparent as Cloudflare though, but with much more control.

I think there is something you can set on Meteor to deliver static files from a CDN, but I’m not sure about the JS file.

Dynamic imports can help it too, if your JS file is too big.

Galaxy does not have an API or auto-scaling, but it’s a Meteor app, you can try to reverse engineer it and do it yourself.

EDIT: I just tested it: Our JS file has 1.3MB in size (after some optimisations). Without Cloudflare (or with the flag marked as EXPIRED) it takes 3.5s to download (in a 150Mbps connection), with Cloudflare, it takes 969ms (HIT) (in a second test it was 465ms).

elie · August 19, 2017, 11:33pm

Read this about CDNs:

Can’t believe this wasn’t used. You’re way over thinking things if you’re not even using a CDN. Definitely mentioned it in your threads already.

XTA · August 20, 2017, 9:47pm

OVH is a french company and has own datacenters in Canada, France and Germany. We’re using the public cloud instances:

https://www.ovh.com/us/public-cloud/instances/prices/

You can easily add new instances within a minute.

sbr464 · August 21, 2017, 1:57am

This is pretty normal, and can be improved greatly if some production items are implemented.

CDN for the main Javascript bundle, css, and any static files (images / favicon / anything).
Only use the smaller servers, and scale the instances up before scaling the size.
Setup at least a main backend cluster that you can connect to and offload method calls / data processing to. Ideally use microservices/faas or a few small clusters to handle different apis.
Make sure all needed indexes are added to the database.
Filter out any fields from documents/methods to reduce that transfer size.
Make sure any methods/function are non blocking on the server
Setup a cache on the backend for recurring data, You can track client connection ids on the server to build a cache.
Server side rendering is usually blocking, make sure you are using a cache object for SSR, which will greatly increase performance. Next.js has this same issue of 40-50 clients, because the initial SSR load is blocking on the server, but once you enable a cache it’s scales much larger.

If you can learn to use Google Cloud Platform directly, you can scale your app up much larger using Container Engine. You can leverage containers/pods/instances to scale to 100s of pod instances for the same price of a fraction on galaxy.

sbr464 · August 21, 2017, 2:17am

You can test this initial loading issue with this tool in addition to what you mentioned,

## install artillery globally on your system
npm install -g artillery

## This is an example command (one line) which will run for 200 seconds, creating 5 virtual users every second, that will send 30 GET requests each.
artillery quick --duration 200 --rate 5 -n 30 https://yourapp.com/

## when done it will dump a .json  report, you can generate a visual with:
artillery report artillery_report_[your info].json

You can adjust the users/requests, but you will quickly see a backlog of concurrent users because of the initial bundle /files download. Offload that and any SSR to a CDN/cache and it will cease to be an issue. Rerun the artillery test and you will see the # of concurrent users to be much lower, since they are able to be processed more quickly and don’t exponentially overload the server. That is the main problem.

kulttuuri · August 23, 2017, 7:56am

Interesting. I’ve mostly heard people not getting cloudflare cache setup with Meteor. Do you have http proxy & CDN enabled in the DNS section of cloudflare for your domain or is it just passthrough for your meteor hosted domain? Has this been working nicely for you?

raphaelarias · August 23, 2017, 12:43pm

Every connection goes thought Cloudflare (orange cloud), yes. Only for third-party services that we’ve deactivated Cloudflare and use it only as nameserver.

Yes. Works perfectly, even better than Galaxy-only setup. Due to Cloudflare caching these files, the app is faster to load, even after we deploy a new version.

We do use Argo (from Cloudflare) too. In our Meteor app, we use appcache too, even though appcache is not perfect, it really helps causing an “impact” when we show to our customers how fast the app loads.

Just be advised, that if you have high traffic meteor app, you may have to use a Pro version (or higher) of Cloudflare (but they will let you know) due to the “limit” in web-sockets connection.

And regarding latency or web-sockets: perfect. We never had any error.

Goes without saying that dynamic-imports just works too.

PS: Were were looking to provide our customers with their own URL (to allow them to use their own domain to access the app), and Cloudflare has a service for that too, but only Enterprise plans (+2k per month).

PS 2: I’m not from Cloudflare’s team hehe.

kulttuuri · August 23, 2017, 4:19pm

Interesting. Cloud you provide your complete cloudflare settings that you got this working with?

I just tried this today on my test server and started getting two issues; After updating the app (deployed new version on mup) my application started refreshing constantly, seemed like it tried to hotpush to latest version but always seemed to think that it was on wrong version. This happened on two different computers. Was solved after I disabled cloudflare passthrough. Did you do some special configurations to bypass this problem?

Another issue I got was that it disabled socket connections. I got error about this on chrome console. I had to also disable .mydomain/sockjs/ to let socket connections bypass cloudflare. Did you also do this?

FYI; I’ve been personally using amazon cloudfront in the past which was perfect after I configured it correctly. However I’ve moved away from it after moving to microsoft sponsored super servers Would be however cool to easily get cloudfront going as it seemed to give little bit of boost to my load times when it was working (around 100-200ms when user connected the second time).

I’ve personally configured in my meteor server to cache most files with personalized TTL times, which usually gets cached to user computer and is the fastest way around for returning users. However for initial connections cloudflare would be great to serve the files faster for new users.

ping @raphaelarias

waldgeist · August 23, 2017, 8:10pm

Thanks for writing this up!

One question on this:

How do you achieve this, now that the old cluster package is abandoned? I mean, how do achieve this with sticky sessions and Meteor’s reactivity still in place, if you need them? I cannot refactor my whole app into microservices (yet).

sbr464 · August 23, 2017, 8:44pm

Not sure I understand your question, it depends on the environment. If you are using Galaxy, the sticky sessions are taken care of with the cookie they give to each client, to associate the user with a particular instance. If you are using Google Cloud, you can use the same thing with their load balancer, a cookie/sticky session per client, and also give it a max time if needed. Also, using Google Container Engine (GCE) (Galaxy is also using containers) there is no need to manage multiple processes.

In GCE, You setup a compute instance ( a normal VM, say 2x cpu 16gb mem), Then you create pod instances. You can customize the min/max resources each pod can have. 1 instance of meteor can run on a container in a pod.

You can over-scale the pods, say 6 pods per instance.

1 x Compute Instance - 2x cpu, 7.5gb mem | 6 Pods
1 x Compute Instance - 2x cpu, 7.5gb mem | 6 Pods
1 x Compute Instance - 2x cpu, 7.5gb mem | 6 Pods

= 18 Pods or instances of Meteor running

GCE Instance Cost: $145.64 = $8.09/month per Meteor instance.
This is ($600-700+/month on galaxy) which is how this is calulated:

From AWS

1 ECU is the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor

In Galaxy a Micro Pro instance is 0.5 ECU & 512MB and costs $40/month.
The above Google Cloud config gives you ~28 ECU (2.4ghz processors (2.4 ECU) x 2 per instance), and 22gb of Memory.

18 instances of meteor could use up to 1.5 ECU in that configuration. Even if you scaled down to 1.0 ECU per Meteor instance, they would get 2x the resources of the Micro pro instance on Galaxy. So you can see if you scaled down the instance sizes dramatically to match the resources in Galaxy, it would be even cheaper

Google’s Load balancer allows you to configure a max number of connections before failing over, or keep the cpu under a certain threshold etc. in addition to what nginix could offer as well.

You can also run nginx within this configure in addition to the Google Load Balancer for more advanced configuration. Add 1-2 pods of nginx per instance easily.

You probably don’t even need this number of instances running. But having this many pods helps with the exponential contention issue. If you run less pods and make them more powerful, they will just go under-utilized or end up being locked in over-utilization during spikes.

I wrote about this here for Next.js, but it can be thought of the same for Meteor, or any Node.js app.

sbr464 · August 23, 2017, 8:49pm

In addition to the above note,

For the microservices bit, one way to do this easier, is to just run a separate server side only backend, that only your Meteor server side methods call. This helps offload expensive cpu tasks, without dealing with the client side. It’s really just a matter of moving those methods to another server cluster, and having an exported connection you import and call on the normal Meteor servers.

For the cluster plugin, leaving everything as smaller instances and scaling those, is much better. You are able to leave the scaling to the Kubernetes scheduling algorithms, setup autoscaling, and easily over scale the servers to avoid contention / lock up.

raphaelarias · August 24, 2017, 4:19pm

This may be happening due to type of caching you selected in Cloudflare, if it’s not taking in consideration query strings it will serve an outdated version over and over again, and when Meteor initialises after the reload it will reload again trying to fetch the latest version.

We use Websockets with no changes (it just works for us), via Cloudflare too. There is a setting in Cloudflare to allow Websocket, thought.

My configuration:

Page rules:

Speed:
Auto minify: all on
Polish: Lossy
Mirage: on
Rocket Loader: Automatic

Caching:
Caching level: Standard
Browser cache expiration: 2 hours
Always online: on

Crypto:
SSL: Flexible (for our website, but we overwrite in the Page rule)
Opportunistic encrypt: on
TLS 1.3: on
Automatic HTTPS Rewrites: on

Traffic:
Argo: on

Network:
HTTP/2 + SPDY: on
Websockets: on

We use the force HTTPs on Galaxy, Cloudflare’s WAF and basically that.

PS: We use Cloudflare Pro, but for months we used Free and it just worked too.

evolross · August 24, 2017, 10:02pm

TL;DR: A CDN for the static assets was the answer.

I successfully set up AWS CloudFront to serve my app assets (JS bundle, CSS, images and everything else in /public) per the Galaxy docs above. It’s working great and I’m getting vastly better performance. Very reasonable especially if I switch off Galaxy at some point in the future. I also think moving to redis-oplog and doing additional optimization in my app code will help with CPU pressure (will definitely continue testing).

FWIW adding a CDN was on my roadmap, just hadn’t done it yet. I thought because the page was loading and I was seeing my “loading animation” that the problem must be with Mongo, Compose, or the Oplog in general. I didn’t consider that all the other simultaneous users downloading the JS bundle at the same time would block retrieving data from Mongo for the users who had downloaded the bundle already.

Here’s an updated performance graph for my app. 50 simultaneous users on 1 Galaxy Compact now runs with a 4s response time for all users versus up to 60s without a CDN. That’s a 15x improvement on performance. I’m now getting the same performance from one Galaxy Compact Container (0.5 ECU) that I previously got from one Galaxy Double Container (2.1 ECU). That’s a 4X improvement on cost. Notice the slowest asset is now the HTML file for my app. Something else I noticed was doubling the containers or doubling a single container’s size increased performance by 4x - 7x (way more than 2x). The CPU still pegs a lot, but my app never crashes, just slows down a little. Again, the CPU pegging will probably be reduced by additional code optimizations:

Thanks for all the help and responses. Marking this as SOLVED. I do have a couple questions though:

I can see the slowest asset served is now the original HTML document that the URL delivers that loads everything. Why does this also not get served from CloudFront? I noticed it’s URL is still my app’s domain (unlike almost everything else which is my CloudFront URL.)
My bundled JS file is now served by CloudFront - which is great. But I’ve read in other posts that there’s a way to take the really common JS packages (e.g. jquery, moment, etc.) and link to them from their respective third party CDNs thus reducing the size of your bundled JS package. And this also allows the browser to use the package immediatley if it’s cached. How is this achieved? Does it have to do with Meteor 1.5 dynamic importing?
@sbr464 Great list of optimizations for production and the GCE write-up. In general, why do you suggest using smaller servers and scaling instances first versus size? On Galaxy, the ECUs and price all seem to equal out. And the performance is almost the same looking at my above graph. And it seems a higher ECU instance would handle random simultaneous spikes better versus several lower ECU instances. Just curious about this. Also I’d like to learn more about setting up a backend app to handle Meteor Methods and data processing - is there any good posts about this? And also setting up a cache for recurring data. Know any good posts?

artillery doesn’t seem to work well for Meteor apps. It seems to test the same way as JMeter. It fires HTTP GETs but it’s not actually running any JS, so at least in my app and todos it’s not running the app (so no routing, pub/sub, method calls, database requests, etc.) So the server consequently works wonderfully as it’s just serving the small HTML file. Galaxy also doesn’t register any connections (same with JMeter).

evolross · August 24, 2017, 10:19pm

This explains a lot.

sbr464 · August 24, 2017, 10:21pm

yep. so a micro instance is running 500mhz or possibly more, sometimes the underlying provider will allow for a brief spikes of 2x usage etc. But that is the baseline. That’s why having more instances helps.

sbr464 · August 24, 2017, 10:39pm

Yes you are correct, I’ll post back what we use for headless/browser environment testing.

kulttuuri · August 25, 2017, 5:35am

Thank you, fantastic! The query string value was probably it, I’ll toy around with this today.

xvendo · March 13, 2018, 3:36am

evolross · March 13, 2018, 5:34pm

Yeah I had read that article long before posting this thread. That’s kind of a Meteor Scaling 101 Beginner’s Guide (e.g. add indexes, enable oplog-tailing, etc.) The issue I discussed above was different. Notice none of these “Scaling Meteor” articles mention enabling a CDN to serve your bundle which you should absolutely do.

Once I finish my scaling saga, which I’m almost there, I plan to write a comprehensive article on all the tests and approaches and strategies I ultimately found successful. I’ve went from pub-sub to all Meteor Methods to back to pub-sub to redis-oplog… it’s been a long journey. Here’s one that almost no one talks about that’s been a HUGE gain for me (mostly because of my app’s use-case)… database caching on the server. Cache shared data that a lot of users request.