[SOLVED] Poor Galaxy Meteor Performance Serving Small Bursts of Users Load Test

You can test this initial loading issue with this tool in addition to what you mentioned,

## install artillery globally on your system
npm install -g artillery

## This is an example command (one line) which will run for 200 seconds, creating 5 virtual users every second, that will send 30 GET requests each.
artillery quick --duration 200 --rate 5 -n 30 https://yourapp.com/

## when done it will dump a .json  report, you can generate a visual with:
artillery report artillery_report_[your info].json

You can adjust the users/requests, but you will quickly see a backlog of concurrent users because of the initial bundle /files download. Offload that and any SSR to a CDN/cache and it will cease to be an issue. Rerun the artillery test and you will see the # of concurrent users to be much lower, since they are able to be processed more quickly and don’t exponentially overload the server. That is the main problem.

6 Likes

Interesting. I’ve mostly heard people not getting cloudflare cache setup with Meteor. Do you have http proxy & CDN enabled in the DNS section of cloudflare for your domain or is it just passthrough for your meteor hosted domain? Has this been working nicely for you? :slight_smile:

Every connection goes thought Cloudflare (orange cloud), yes. Only for third-party services that we’ve deactivated Cloudflare and use it only as nameserver.

Yes. Works perfectly, even better than Galaxy-only setup. Due to Cloudflare caching these files, the app is faster to load, even after we deploy a new version.

We do use Argo (from Cloudflare) too. In our Meteor app, we use appcache too, even though appcache is not perfect, it really helps causing an “impact” when we show to our customers how fast the app loads.

Just be advised, that if you have high traffic meteor app, you may have to use a Pro version (or higher) of Cloudflare (but they will let you know) due to the “limit” in web-sockets connection.

And regarding latency or web-sockets: perfect. We never had any error.

Goes without saying that dynamic-imports just works too.

PS: Were were looking to provide our customers with their own URL (to allow them to use their own domain to access the app), and Cloudflare has a service for that too, but only Enterprise plans (+2k per month).

PS 2: I’m not from Cloudflare’s team hehe.

2 Likes

Interesting. Cloud you provide your complete cloudflare settings that you got this working with?

I just tried this today on my test server and started getting two issues; After updating the app (deployed new version on mup) my application started refreshing constantly, seemed like it tried to hotpush to latest version but always seemed to think that it was on wrong version. This happened on two different computers. Was solved after I disabled cloudflare passthrough. Did you do some special configurations to bypass this problem?

Another issue I got was that it disabled socket connections. I got error about this on chrome console. I had to also disable .mydomain/sockjs/ to let socket connections bypass cloudflare. Did you also do this?

FYI; I’ve been personally using amazon cloudfront in the past which was perfect after I configured it correctly. However I’ve moved away from it after moving to microsoft sponsored super servers :slight_smile: Would be however cool to easily get cloudfront going as it seemed to give little bit of boost to my load times when it was working (around 100-200ms when user connected the second time).

I’ve personally configured in my meteor server to cache most files with personalized TTL times, which usually gets cached to user computer and is the fastest way around for returning users. However for initial connections cloudflare would be great to serve the files faster for new users.

ping @raphaelarias :slight_smile:

Thanks for writing this up!

One question on this:

How do you achieve this, now that the old cluster package is abandoned? I mean, how do achieve this with sticky sessions and Meteor’s reactivity still in place, if you need them? I cannot refactor my whole app into microservices (yet).

Not sure I understand your question, it depends on the environment. If you are using Galaxy, the sticky sessions are taken care of with the cookie they give to each client, to associate the user with a particular instance. If you are using Google Cloud, you can use the same thing with their load balancer, a cookie/sticky session per client, and also give it a max time if needed. Also, using Google Container Engine (GCE) (Galaxy is also using containers) there is no need to manage multiple processes.

In GCE, You setup a compute instance ( a normal VM, say 2x cpu 16gb mem), Then you create pod instances. You can customize the min/max resources each pod can have. 1 instance of meteor can run on a container in a pod.

You can over-scale the pods, say 6 pods per instance.

1 x Compute Instance - 2x cpu, 7.5gb mem | 6 Pods
1 x Compute Instance - 2x cpu, 7.5gb mem | 6 Pods
1 x Compute Instance - 2x cpu, 7.5gb mem | 6 Pods

= 18 Pods or instances of Meteor running

GCE Instance Cost: $145.64 = $8.09/month per Meteor instance.
This is ($600-700+/month on galaxy) which is how this is calulated:

From AWS

1 ECU is the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor

In Galaxy a Micro Pro instance is 0.5 ECU & 512MB and costs $40/month.
The above Google Cloud config gives you ~28 ECU (2.4ghz processors (2.4 ECU) x 2 per instance), and 22gb of Memory.

18 instances of meteor could use up to 1.5 ECU in that configuration. Even if you scaled down to 1.0 ECU per Meteor instance, they would get 2x the resources of the Micro pro instance on Galaxy. So you can see if you scaled down the instance sizes dramatically to match the resources in Galaxy, it would be even cheaper

Google’s Load balancer allows you to configure a max number of connections before failing over, or keep the cpu under a certain threshold etc. in addition to what nginix could offer as well.

You can also run nginx within this configure in addition to the Google Load Balancer for more advanced configuration. Add 1-2 pods of nginx per instance easily.

You probably don’t even need this number of instances running. But having this many pods helps with the exponential contention issue. If you run less pods and make them more powerful, they will just go under-utilized or end up being locked in over-utilization during spikes.

I wrote about this here for Next.js, but it can be thought of the same for Meteor, or any Node.js app.

3 Likes

In addition to the above note,

For the microservices bit, one way to do this easier, is to just run a separate server side only backend, that only your Meteor server side methods call. This helps offload expensive cpu tasks, without dealing with the client side. It’s really just a matter of moving those methods to another server cluster, and having an exported connection you import and call on the normal Meteor servers.

  • For the cluster plugin, leaving everything as smaller instances and scaling those, is much better. You are able to leave the scaling to the Kubernetes scheduling algorithms, setup autoscaling, and easily over scale the servers to avoid contention / lock up.
1 Like

This may be happening due to type of caching you selected in Cloudflare, if it’s not taking in consideration query strings it will serve an outdated version over and over again, and when Meteor initialises after the reload it will reload again trying to fetch the latest version.

We use Websockets with no changes (it just works for us), via Cloudflare too. There is a setting in Cloudflare to allow Websocket, thought.

My configuration:

Page rules:

Speed:
Auto minify: all on
Polish: Lossy
Mirage: on
Rocket Loader: Automatic

Caching:
Caching level: Standard
Browser cache expiration: 2 hours
Always online: on

Crypto:
SSL: Flexible (for our website, but we overwrite in the Page rule)
Opportunistic encrypt: on
TLS 1.3: on
Automatic HTTPS Rewrites: on

Traffic:
Argo: on

Network:
HTTP/2 + SPDY: on
Websockets: on

We use the force HTTPs on Galaxy, Cloudflare’s WAF and basically that.

PS: We use Cloudflare Pro, but for months we used Free and it just worked too.

4 Likes

TL;DR: A CDN for the static assets was the answer.

I successfully set up AWS CloudFront to serve my app assets (JS bundle, CSS, images and everything else in /public) per the Galaxy docs above. It’s working great and I’m getting vastly better performance. Very reasonable especially if I switch off Galaxy at some point in the future. I also think moving to redis-oplog and doing additional optimization in my app code will help with CPU pressure (will definitely continue testing).

FWIW adding a CDN was on my roadmap, just hadn’t done it yet. I thought because the page was loading and I was seeing my “loading animation” that the problem must be with Mongo, Compose, or the Oplog in general. I didn’t consider that all the other simultaneous users downloading the JS bundle at the same time would block retrieving data from Mongo for the users who had downloaded the bundle already.

Here’s an updated performance graph for my app. 50 simultaneous users on 1 Galaxy Compact now runs with a 4s response time for all users versus up to 60s without a CDN. That’s a 15x improvement on performance. I’m now getting the same performance from one Galaxy Compact Container (0.5 ECU) that I previously got from one Galaxy Double Container (2.1 ECU). That’s a 4X improvement on cost. Notice the slowest asset is now the HTML file for my app. Something else I noticed was doubling the containers or doubling a single container’s size increased performance by 4x - 7x (way more than 2x). The CPU still pegs a lot, but my app never crashes, just slows down a little. Again, the CPU pegging will probably be reduced by additional code optimizations:

Thanks for all the help and responses. Marking this as SOLVED. I do have a couple questions though:

  • I can see the slowest asset served is now the original HTML document that the URL delivers that loads everything. Why does this also not get served from CloudFront? I noticed it’s URL is still my app’s domain (unlike almost everything else which is my CloudFront URL.)

  • My bundled JS file is now served by CloudFront - which is great. But I’ve read in other posts that there’s a way to take the really common JS packages (e.g. jquery, moment, etc.) and link to them from their respective third party CDNs thus reducing the size of your bundled JS package. And this also allows the browser to use the package immediatley if it’s cached. How is this achieved? Does it have to do with Meteor 1.5 dynamic importing?

  • @sbr464 Great list of optimizations for production and the GCE write-up. In general, why do you suggest using smaller servers and scaling instances first versus size? On Galaxy, the ECUs and price all seem to equal out. And the performance is almost the same looking at my above graph. And it seems a higher ECU instance would handle random simultaneous spikes better versus several lower ECU instances. Just curious about this. Also I’d like to learn more about setting up a backend app to handle Meteor Methods and data processing - is there any good posts about this? And also setting up a cache for recurring data. Know any good posts?

artillery doesn’t seem to work well for Meteor apps. It seems to test the same way as JMeter. It fires HTTP GETs but it’s not actually running any JS, so at least in my app and todos it’s not running the app (so no routing, pub/sub, method calls, database requests, etc.) So the server consequently works wonderfully as it’s just serving the small HTML file. Galaxy also doesn’t register any connections (same with JMeter).

8 Likes

:open_mouth: This explains a lot.

yep. so a micro instance is running 500mhz or possibly more, sometimes the underlying provider will allow for a brief spikes of 2x usage etc. But that is the baseline. That’s why having more instances helps.

Yes you are correct, I’ll post back what we use for headless/browser environment testing.

Thank you, fantastic! The query string value was probably it, I’ll toy around with this today.

1 Like
1 Like

Yeah I had read that article long before posting this thread. That’s kind of a Meteor Scaling 101 Beginner’s Guide (e.g. add indexes, enable oplog-tailing, etc.) The issue I discussed above was different. Notice none of these “Scaling Meteor” articles mention enabling a CDN to serve your bundle which you should absolutely do.

Once I finish my scaling saga, which I’m almost there, I plan to write a comprehensive article on all the tests and approaches and strategies I ultimately found successful. I’ve went from pub-sub to all Meteor Methods to back to pub-sub to redis-oplog… it’s been a long journey. Here’s one that almost no one talks about that’s been a HUGE gain for me (mostly because of my app’s use-case)… database caching on the server. Cache shared data that a lot of users request.

4 Likes

We’d be really interested to read that @evolross, this thread has already been very enlightening. Many thanks for posting everything here.

I’m particularly curious on the database caching. We’ve talked about how to do something along those lines a few times but never went anywhere with it. Are there any purpose built packages that can cache a collection ?

If Meteor really has problems with 40 simultaneous logins, what Is the official statement form the meteor galaxy team. I mean If that would be me, than I would be a pain in the a…

I think if there were an official statement on this, it would be something to the effect that every application is different and will thus need engineering applied in areas specific to their particular architecture and usage patterns to maximize performance.

There are however general areas of concern that can be tackled on all applications and I think the resources touched in this thread cover those pretty well.

1 Like

I don’t know any packages that can cache a collection but the grapher package offers cached queries (kinda like a Meteor Method that can cache return values). I read the grapher docs and the code behind its caching and it’s very simple.

It’s basically just an object dictionary in server-only code that you store values in. You can use a key/pair or any kind of id/hash to keep various blocks of data separated. If your server restarts, it gets reset, which is fine as it’s just a cache and will get repopulated as soon as the first client needs data.

So if you had a Meteor Method like getChatroomHistory(chatroomId) (oversimplified example), you would create a cache object like var cachedChatroomHistories = {}. Then in that method you would first check if the cache object contains that chatroom’s id (and thus history dataset). If so, get the data from there and return it, if not, query for it, put it in the cache, and return it too. Then after querying you always add a setTimeout to delete and clear out that dataset - a TTL of sorts to keep the data fresh. This works really well and is very simple to implement.

What I’ve found is straight-up Meteor Methods that query the database will always hit Mongo (unlike pub-sub). So if you have 1000 users in a chat and they need the static history that’s non-reactive, if you use a Meteor Method (which is logical) you’re querying the database 1000 times for the same data. If you go pub-sub, which does cache the data for the same query and thus save hits to Mongo, you get all the overhead of pub-sub which you don’t need because the data is static. And I found out one of those bits of overhead is that the Meteor server will duplicate the subscription data for every client on the server because it keeps a copy of the data each client is subscribed too. This may not be a lot of data (as 1KB x 1000 users is only ~1MB of RAM) but it’s just annoying especially if that data is truly static and doesn’t need the benefits of reactive pub-sub. So I’ve found the way to go is Meteor Methods using caches. Again all this is useful when you need to deliver the same data set. Even if it updates frequently, you can save a TON of processor and database calls by caching and polling/recalling the Meteor Method that gets the data to the client.

@xvendo Did you read this thread? I solved the issue. So no need for a statement by MDG.

3 Likes

I am having the same problem with CPU spiking on initial load. The thing is – I am already using Cloudflare and a service worker so nearly none of my clients hit my servers for their assets. I think this problem remains and that adding a CDN probably just kicked that can down the performance road a bit.

My app is highly complex. Each user has ~23 subscriptions, 4 of which are “reactive” in that they depend on data from multiple collections to publish their documents. I publish a few hundred kb to 1M of data per client. I have optimized just about everything I can think of:

  • All assets served from CDN
  • Serviceworker caches static assets on client
  • All collections have indexes as appropriate
  • I’ve tried both regular oplog and Redis oplog
  • I’ve got an Nginx load balancer with sticky sessions
  • Each instance has 1vcpu (Google Kubernetes Engine) and 1.5GB of ram available
  • I only publish the necessary fields
  • I try not to use any non-standard meteor packages other than Kadira and synced-cron. I rewrote all of my reactive publications by hand just to remove reactive-publish.
  • I’m using the latest meteor version
  • My background jobs (using synced-cron) run on a separate instance that doesn’t serve client traffic
  • My database has ample overhead

Despite all of this, I see the exact same symptoms as @evolross. If I scale down my number of instances so that > ~55 clients reconnect and start their publications at the same time on one instance then the instance CPU spikes to 100% and hangs there as response time goes through the roof and the instance is eventually killed by my health checker. If an instance survives the initial spike it chugs along happily at 10-20% CPU usage with 50 sessions on it. If it weren’t for the startup spike I could probably handle > 200 users per instance.

I’m running short on ideas. As far as I can tell I’m doing nothing wrong and it is simply the fact that meteor publications are too expensive for a single process to survive > 1000 observers (50 users x 20 pubs) starting at the same time.

I’m considering refactoring parts of my app to use Apollo in an effort to avoid the meteor publication death load but I would love to avoid undertaking that huge project if I don’t have to. Any ideas?