As most of you know, Meteor scaling has been a constant source of FUD inside and outside the community. And many of you here managed to build and scale Meteor applications in production.
In the same spirit as the " What do you expect from your Meteor PWA", I think it is worth documenting the best practices and insights this community has around scaling.
Thus, if youāve managed to scale a Meteor app in production, I would really appreciate if you could take a minute to answer these:
Any deep insights you think itād worth sharing with the community?
What are the common pitfalls one should avoid?
Any advice, tools, packages, best practices you would like to share with the community?
Any articles, references or gerenal performance/scaling tips you would like to share or learn about?
Your answers would help me to start writing a manual/tutorial on this subject. And please keep the discussion constructive and positive.
Iāll start myself with tips Iāve learnt along the way:
If you donāt need real-time then just use Meteor methods or Apollo. This will make Meteor as scalable as any other socket based node app
Do use out of the box pub/sub to validate ideas quickly but donāt expect to scale without any tweaks
If youāve a lot of per user pub/sub observers and you donāt want to worry about the technicalities of scaling, then Iād recommend Galaxy
Use as many re-usable publications as you can, the cost of scaling those are minimal compared to user specific publications (from my test, on average user specific publications are around 6 times more expensive in memory then generic re-usable publications)
If youāve many user specific publications then Iād either switch to redis or really bump up the memory per VM
If you expect a lot of traffic spike then you need to overprovision the VMs or cluster meteor instances on the CPUs. Remember node is single threaded so by default it doesnāt leverage the multiple CPUs, so you either fork multiple meteor processes or scale the app horizontally with many small nodes and put load balancer in front of it, this is true for all node applications
If you really want to have full control over you reactivity try the awesome redis-oplog package by cult-of-coders.
May drastically increase the number of reused observers and, in the end, greatly reduce both DB and server pressure. Once Iāve split such a publication into 4 (mixed ~> global, organisation, month, user) and reduced costs by 30% (we switched to a smaller DB instance and reduced the number of containers).
Thank you for such a valuable information. Iāve been looking for a solution to this problem for a long time and I finally found one. Really appreciate your replies!
It often makes a big difference to increase the memory limit on the Meteor server. You can do this by setting the NODE_OPTIONS environment variable to --max-old-space-size=whatever in production for the server process.
Well, the best advice that I can give is to have separate apps for Frontend and Backend. That allows us to scale independently and thus reduce cost.
Right now weāre able to run 6 concurrent users on the Backend (which has very CPU intense tasks and lotās of MongoDb ops) with 75% CPU and 70% memory utilization on the smallest AWS Fargate box.
At the same time, our Frontend was at less than 1% CPU and 30% Memory. Thereās still room to optimize Frontend more by having a really critical look where we can use remote calls/methods instead of reactivity.
Also ALWAYS use fields in your query. I know itās a PITA to then have errors when you miss just one but the benefit is indeed to minimize the amount of data transferred.
BTW, at the above load peak (so far) we had 10% Atlas CPU on the M10 resource. IOPS were at 150, so still some room to the 1000 IOPS limit they set.
I donāt use pub/sub to load list of documents. I use methods to load them.
If I need the list updated automatically, then I use an other collection to stores changes. Then use pub/sub with one document only. If there is any change, use method to fetch the list again.
Meteor works fine with load balancing. Iām currently runing a set of many Meteor instances and mongodb instances (replicaset) on Google Cloud. I can turn off some servers doing upgrades then turn them on without downtime. The only problem is Redis Oplogs. It doesnāt support redis cluster. I will have problem if that server is down.
I personally think that pub-sub remains the core critical component about scaling Meteor. If you can avoid it, use other techniques, like Meteor methods instead. Itās very useful but often the hardest to scale.
If you have data which changes uncommonly frequent but is not needed in a pub-sub, it might pay out to put it into a separate MongoDB cluster or even a different database. Meteor reads the oplog collection of your MongDB cluster. The more changes you have in your cluster, the more data your system will have to process. See: https://stackoverflow.com/questions/20535755/using-multiple-mongodb-databases-with-meteor-js
The package https://github.com/cult-of-coders/redis-oplog only communicates the changes happening from within Meteor - but if you need to also follow along in changes other systems are doing to the database without having to send additional messages to redis (as described in https://github.com/cult-of-coders/redis-oplog/blob/master/docs/outside_mutations.md), you can combine redis-oplog with the go application https://github.com/tulip/oplogtoredis. Every Meteor instance by default follows along with the oplog collection of your cluster. Using redis-oplog alongside with oplogtoredis, itās only the go application reading the oplog collection. This should ease the load on the individual Meteor instances and on the database cluster.
I recently ran into a situation where my processing intensive instance was doing nothing for 98% of the time and when it needed to do something it took a long time (not good for the users since they are waiting for the result). This is running on galaxy on a small instance to keep costs down.
Did some analysis and CPU was the bottleneck during execution. I ported this to Amazon Lambda. Now this functionality is 10-20 times faster, and for the 2% of the time its actually doing something, I get charged a tiny amount just for that. Rest of the time when its Idle, there is no cost. It also scales infinity automatically.
I was able to massage my current meteor app code to use a subset of the JS and off the shelf mongo driver. I also added the batch methods for the mongo db driver (insert multiple, bulk write multiple). This sped the app by a lot. I also batched the āfindā to fetch a set of records instead of one at a time.
I wish the Meteor code would run out of the box on Lambda (without the pub/sub) and I could link Lambda functions to it.
I would love to see out of the box support in meteor for AWS Lambda.
Imagine This
Lambda - hosts anything processing intensive that you want to parallelize and scale infinitely on demand.
Regular Docker Meteor (e.g. Galaxy) - Provides reactive pub/sub and quick api calls.
App is in Beta (small group of users) and schedules tasks into a personās calendar. It only needs to do the scheduling if something changes (tasks change or calendar changes) but when it does there is a lot of processing that occurs for a short period of time.
Yes got it, I was curious about the CPU intensive use case, I will definitely check the app.
Have you thought about using worker thread for the CPU intensive task?
I did something similar with an image processing function. Images where uploaded to storage directly from client (to prevent eating the server RAM) and the processing (compression and manipulation) were done using Google Cloud Functions to protect the server CPU.
I usually try to offload any CPU or RAM intensive tasks of the server machine (specially NodeJS servers) and restrict the server to serving results.
Thanks for sharing again, I think itās common, I had my share of those
No, Did not think of using worker threads. Iām assuming you mentioned that for parallel execution? (CPU was already pegged on Galaxy so would not have helped)
I was using a separate instance from the user facing instance to prevent impact (which I think is what you mentioned) but it was still too slow and scaling would have been a challenge + additional cost.
Lambda was a win for this use case much better on all fronts (much faster speed, lower cost, automatic scaling/parallel execution).
If you are doing lots of file I/O or if you are making heavy use of the database, make sure your dedicated server has an SSD, preferably an NVMe SSD. Hetzner and OVH offer dedicated servers with NVMe SSDs. There are a few other providers as well.
Use Nginx as a reverse-proxy in front of your Meteor app. Nginx will do what it is most efficient at - terminating your HTTPS connection and serving static assets. You want to avoid the node process having its time needlessly wasted.
Our Meteor deployment script also compresses static assets on disk using the brotli compressor. Nginxās brotli module has an option brotli_static that enables Nginx to automatically serve the compressed version of assets from disk by looking for files with the .brotli extension instead of having to waste its CPU time compressing the asset on-the-fly.
If you can, make all subsystems running on the same server communicate with each other using UNIX sockets instead of TCP sockets. Communication over UNIX sockets incurs significantly less overhead.
In our deployments, Nginx passes requests onto Meteor via its UNIX socket file (e.g. /var/lib/mysql/mysql.sock) and Meteor passes requests onto MySQL via its UNIX socket (e.g. /var/run/meteor/meteor.sock).
To configure Meteor to listen on a UNIX socket, specify the UNIX_SOCKET_PATH environment variable.
Other common services like MongoDB, PostgreSQL, Redis and Memcached can be configured to listen on UNIX sockets as well.
Cloudflare supports proxying WebSockets so it works well with Meteor apps. It is useful for:
Preventing the IP address of your origin server from being exposed,
Providing some protection against DDOS attacks
Serving static assets from Cloudflareās edge locations closest to the user and reducing your origin serverās data usage.
If you have a server outage, you can use Cloudflareās API to quickly divert traffic to a hot standby Meteor app server without any DNS propagation delay
Cloudflare also offer a load balancer if you have a cluster of Meteor app servers and need to distribute traffic between them.
If you are using Cloudflare with Meteor, you should modify your Meteor startup script to set the environment variable HTTP_FORWARDED_COUNT=1
If you are using Cloudflare with Nginx and Meteor, you should modify it to set HTTP_FORWARDED_COUNT=2
Scaling/performance tips that DO require changes to your code
Do everything you can to avoid running CPU intensive code in the Node.js event loop. Instead, such code should run asynchronously in the thread pool.
Up until recently, this often had to be done by writing a Node Addon in C++ using Native Abstractions for Node (NaN) or N-API. This approach is commonly used by number-crunching code that has to run at top speed, e.g. high performance crypto packages like bcrypt and shacrypt.
However, thanks to the introduction of the node.js worker_threads module combined with the SharedArrayBuffer data type, it has become more practicable to write thread pool code in JavaScript and get acceptable performance.
There is also nodeās inbuilt cluster module that allows you to spawn multiple node.js (Meteor) worker processes. It has its uses and I have commented on it in the past. Today, I would advise people to first try and solve their problems using the thread pool and only use the cluster module as a last resort.
A few days ago I discovered this NPM package threads.js that claims to āmake web workers & worker threads as simple as a [JavaScript] function callā.
I havenāt had to use it yet, but I will definitely give it a try with Meteor at some point.
Great tips. I want to also remind all of us (including myself because even after 26+ years of doing this I still forget), Profile your code before deciding what to optimize (i.e. find the root cause for why things are slow).
For me, the root cause for many slow things was the code (fetching or updating one record at a time in the DB). Also CPU was also a limit and no amount of threading would help with that so I went to AWS Lambda to run some of the code for the specific situation I had.