[SOLVED] Poor Galaxy Meteor Performance Serving Small Bursts of Users Load Test


#41

We’d be really interested to read that @evolross, this thread has already been very enlightening. Many thanks for posting everything here.

I’m particularly curious on the database caching. We’ve talked about how to do something along those lines a few times but never went anywhere with it. Are there any purpose built packages that can cache a collection ?


#42

If Meteor really has problems with 40 simultaneous logins, what Is the official statement form the meteor galaxy team. I mean If that would be me, than I would be a pain in the a…


#43

I think if there were an official statement on this, it would be something to the effect that every application is different and will thus need engineering applied in areas specific to their particular architecture and usage patterns to maximize performance.

There are however general areas of concern that can be tackled on all applications and I think the resources touched in this thread cover those pretty well.


#44

I don’t know any packages that can cache a collection but the grapher package offers cached queries (kinda like a Meteor Method that can cache return values). I read the grapher docs and the code behind its caching and it’s very simple.

It’s basically just an object dictionary in server-only code that you store values in. You can use a key/pair or any kind of id/hash to keep various blocks of data separated. If your server restarts, it gets reset, which is fine as it’s just a cache and will get repopulated as soon as the first client needs data.

So if you had a Meteor Method like getChatroomHistory(chatroomId) (oversimplified example), you would create a cache object like var cachedChatroomHistories = {}. Then in that method you would first check if the cache object contains that chatroom’s id (and thus history dataset). If so, get the data from there and return it, if not, query for it, put it in the cache, and return it too. Then after querying you always add a setTimeout to delete and clear out that dataset - a TTL of sorts to keep the data fresh. This works really well and is very simple to implement.

What I’ve found is straight-up Meteor Methods that query the database will always hit Mongo (unlike pub-sub). So if you have 1000 users in a chat and they need the static history that’s non-reactive, if you use a Meteor Method (which is logical) you’re querying the database 1000 times for the same data. If you go pub-sub, which does cache the data for the same query and thus save hits to Mongo, you get all the overhead of pub-sub which you don’t need because the data is static. And I found out one of those bits of overhead is that the Meteor server will duplicate the subscription data for every client on the server because it keeps a copy of the data each client is subscribed too. This may not be a lot of data (as 1KB x 1000 users is only ~1MB of RAM) but it’s just annoying especially if that data is truly static and doesn’t need the benefits of reactive pub-sub. So I’ve found the way to go is Meteor Methods using caches. Again all this is useful when you need to deliver the same data set. Even if it updates frequently, you can save a TON of processor and database calls by caching and polling/recalling the Meteor Method that gets the data to the client.

@xvendo Did you read this thread? I solved the issue. So no need for a statement by MDG.


#45

I am having the same problem with CPU spiking on initial load. The thing is – I am already using Cloudflare and a service worker so nearly none of my clients hit my servers for their assets. I think this problem remains and that adding a CDN probably just kicked that can down the performance road a bit.

My app is highly complex. Each user has ~23 subscriptions, 4 of which are “reactive” in that they depend on data from multiple collections to publish their documents. I publish a few hundred kb to 1M of data per client. I have optimized just about everything I can think of:

  • All assets served from CDN
  • Serviceworker caches static assets on client
  • All collections have indexes as appropriate
  • I’ve tried both regular oplog and Redis oplog
  • I’ve got an Nginx load balancer with sticky sessions
  • Each instance has 1vcpu (Google Kubernetes Engine) and 1.5GB of ram available
  • I only publish the necessary fields
  • I try not to use any non-standard meteor packages other than Kadira and synced-cron. I rewrote all of my reactive publications by hand just to remove reactive-publish.
  • I’m using the latest meteor version
  • My background jobs (using synced-cron) run on a separate instance that doesn’t serve client traffic
  • My database has ample overhead

Despite all of this, I see the exact same symptoms as @evolross. If I scale down my number of instances so that > ~55 clients reconnect and start their publications at the same time on one instance then the instance CPU spikes to 100% and hangs there as response time goes through the roof and the instance is eventually killed by my health checker. If an instance survives the initial spike it chugs along happily at 10-20% CPU usage with 50 sessions on it. If it weren’t for the startup spike I could probably handle > 200 users per instance.

I’m running short on ideas. As far as I can tell I’m doing nothing wrong and it is simply the fact that meteor publications are too expensive for a single process to survive > 1000 observers (50 users x 20 pubs) starting at the same time.

I’m considering refactoring parts of my app to use Apollo in an effort to avoid the meteor publication death load but I would love to avoid undertaking that huge project if I don’t have to. Any ideas?


#46

Just as a sanity check - you say you’re using a CDN for “all assets”. Did you follow this article about also serving your Meteor JS bundle via CDN? As that’s the root of the problem of this thread.


#47

Yep, I don’t use meteor build tool. I build my bundles with webpack which gives me a lot more flexibility. I definitely have everything possible cached by service worker and CDN.


#48

I highly recommend taking a CPU profile while the performance issues are happening