First Steps on Scaling Meteor JS

How We Scaled Meteor JS to Handle 30,000 Concurrent Users at Propiedata

Scaling Meteor JS is both an art and a science. At Propiedata, a property management platform with features like virtual assemblies, dashboards, and real-time voting, chat, reactions and participation queues, we successfully scaled our app to handle peaks of 30,000 concurrent users. Here’s how we did it and the lessons learned along the way.

1. Move Heavy Jobs Out of the Main App

Offloading resource-intensive tasks from the main (user facing) application reduces server load and improves the responsiveness of methods and subscriptions. Using external job queues or microservices ensures more stable and dependable performance, especially during peak times.

So what we moved off from the main app?

  • Bulk imports
  • Analytics aggregations
  • Real time data aggregations
  • PDF/HTML rendering
  • Batch data cleansing
  • Batch email sending
  • Puppeteer page crawling
  • Large data reads and documents creation

2. Favor Methods Over Publications

Meteor’s publications can be expensive in terms of server and database resources. While they are powerful, they aren’t always necessary for all types of data. Switching to methods:

  • Reduces load on your servers.
  • Improves response times.
  • Performance is more stable and dependable.
  • Optimizes performance for complex queries.
  • Methods can be easily cached.

What did we fetch with methods?

Everything, we just subscribed to data that required real time data like poll results, chats, assembly state and participation queues.

3. Optimize MongoDB Queries

Efficient database queries are the backbone of scaling any app. Here’s what worked for us:

  • Indexes: Use compound indexes tailored to how your data is queried.
  • Selective Fields: Only retrieve the fields you need.
  • Avoid Regex: Regex queries can be a performance killer.
  • Secondary Reads: Offload read operations to secondary replicas when possible.
  • Monitor Performance: Regularly check for long-running queries and eliminate n+1 issues.
  • Too Many Indexes: Having too many indexes can hurt your write performance.
  • ESR Rule: When creating an index the Equality fields go first, that Sort and at last Range, we will go deeper later.
  • MF3 rule: Most filtering field first, that means that in any query filter a field that filters more should go first.

4. Implement Redis Oplog

Switching to Redis Oplog was a game-changer. It significantly reduced server load by:

  • Listening to specific changes through channels.
  • Publishing only the necessary changes. This approach minimized the overhead caused by Meteor’s default oplog tailing.
  • Debounce requerying when processing bulk payloads.

5. Cache Frequently Used Data

Caching common queries or computationally expensive results dramatically reduces database calls and response times. This is particularly useful for read-heavy applications with repetitive queries.

We used Grapher so that made it easy to cache data in redis or memory.

Don’t make the same error we did at first caching also the firewall or security section of the method calls :man_facepalming:. (We did this before using Grapher)

6. General MongoDB Principles

To get the most out of MongoDB:

  • Always use compound indexes.
  • Ensure every query has an index and every index is used by a query.
  • Filter and limit queries as much as possible.
  • Follow the Equality, Sort, Range (ESR) rule when creating indexes.
  • Prioritize the field that filters the most for the first index position.
  • Always secure access to your clusters.
  • Use TTL indexes to expire your old data.

What is the ESR rule?

The ESR Rule is a guideline for designing efficient indexes to optimize query performance. It stands for:

  1. Equality: Fields used for exact matches (e.g., { x: 1 }) should come first in the index. These are the most selective filters and significantly narrow down the dataset early in the query process.
  2. Sort: Fields used for sorting the results (e.g., { createdAt: -1 }) should be next in the index. This helps MongoDB avoid sorting the data in memory, which can be resource-intensive.
  3. Range: Fields used for range queries (e.g., { $gte: 1 }) should come last in the index, as they scan broader parts of the dataset.

What is the MF3 rule?

Well I just named it that way at the moment of writing, but this rule prioritizes fields that filter the dataset the most at the beginning of the index. Think of it as a pipeline: the more each field filters the dataset in each step, the fewer resources the query uses in the less performant parts, like range filters. By placing the most selective fields first, you optimize the query process and reduce the workload for MongoDB, especially in more resource-intensive operations like range queries.

7. Other Key Improvements

  • Rate Limiting: Prevent abuse of your methods by implementing rate limits.
  • Collection Hooks: Be cautious with queries triggered by collection hooks or other packages.
  • Package Evaluation: Not every package will perfectly fit your needs—adjust or create your own solutions when necessary.
  • Aggregate Data Once: Pre-compute and save aggregated data to avoid repetitive calculations.

8. The Result: Performance and Cost Efficiency

These optimizations led to tangible results:

  • Cost Reduction: Monthly savings of $2,000.
  • Peak Capacity: Serving 30,000 concurrent users for just $1,000/month.

Quick Recap

If you’re looking to scale your Meteor JS application, here are the key takeaways:

  • Offload heavy jobs to external processes.
  • Use methods instead of publications where possible.
  • Optimize MongoDB queries with compound indexes and smart schema design.
  • Leverage Redis Oplog to minimize oplog tailing overhead.
  • Cache data to speed up responses.
  • Think “MongoDB,” not “Relational.”

Almost forgot

We use AWS EBS to deploy our servers, with 4Gb memory and 2vCPUs. Its configured to autoscale, having in mind that nodeJS uses only one vCPU, memory is almost always at 1.5gb. And for MongoDB we use atlas, this also autoscales but it has an issue, autoscaling takes about an hour to scale up when it has a heavy load, so we created a system that predicts usage given the amount of assemblies we have and scales mongo servers accordingly for that period.

I found the presentation we did at Meteor Impact when we were at 15,000 peak concurrent users.

First Steps on Scaling Meteor JS

I hope this help someone, and gives some peace of mind to others that dont think Meteor can scale easily. What I have seen is that most devs just think they can get away with bad design and Meteor because of the real time first approach just make it easier for this issue to be noticed.

29 Likes

Thanks for posting this great info.

2 Likes

Thanks for sharing. Can we pin this somewhere?

@pmogollon, do you remember how many i stances you needed to serve 30k concurrent users?

1 Like

@pmogollon I really liked the way you structured it. Inspired by the “mobile first” concept, I started thinking of code and architecture design in terms of “global first”. The most difficult part of going truly global is DB sharding and staying compliant with personal data geography (e.g. keep al personal data of a European in Europe, but Italians in Italy, US in US etc … based on the country law).
I am curious what you are running in Atlas on a “normal” day and when you are scaled and the parity between number of concurrent users and the number of DB connections you see in the respective Atlas graph. Also :slight_smile: what is your maxPoolSize in the DB connection setting? In fact, if you can, would be great to see your entire connection settings. I use this, but I am ATM nowhere near your traffic:

const options = '?retryWrites=true' +
  '&maxIdleTimeMS=5000' +
  '&maxPoolSize=30' + // default 100
  '&readConcernLevel=majority' +
  '&readPreference=secondaryPreferred' +
  '&w=majority' +
  '&heartbeatFrequencyMS=15000'
3 Likes

Really useful post. Much appreciated!

1 Like

Thank you for sharing. :clap:

1 Like

Thank you for sharing that with us, Paulo!

I just sent you a message, let’s turn this awesome content into an article :slight_smile:

2 Likes

We have not gotten to a point where we need sharding, and fortunately we dont need to have personal data of users on different locations as we only serve clients on Latin America, mainly Colombia.

So for a normal day we run an M20 server, although we could run an M10, but for our case it was on the limit and we were not able to scale it fast enough when traffic started flooding the app servers.

When we do expect a large load we scale to an M30 and then is automatically scaled by atlas, unless we expect and extremely high amount of users so we scale to a M40. We do this automatically with a cron job as we can know what is going to be the expected number of assistant to each of the assemblies.

Knowing the number of expected assistants is something that makes it “easy” to scale but also a bit hard. Easy because we can prepare ahead of time manually as we did at first, or automate it as we do know. But is also hard because our users for the assemblies might connect all at the same time, which makes the scaling groups to not be able to react fast and fail fast.

Automatic cluster scaling config

Minimium cluster size: M20
Maximum cluster size: M50
Storage scaling: true

We used to have a large minPoolSize around 50 at somepoint, but we learnt that that was not helping and probably hurting the performance of the db. So now we have the default maxPoolSize and minPoolSize. Our connection string is just the default ?retryWrites=true&w=majority. We do read from secondaries but only when it makes sense, for data that doesnt change that much or we get a value that is a bit outdated it doesnt create an issue. That is the case of the users and roles queries.

I dont have that number, but we know that our servers can handle a load of around 1,500 users. But this depends on a lot of factors, specially how much users interact in the assembly. It could be around 25 to 30 on those peaks, which are not the norm.

One thing that was very important for us was setting VPC peering, we used to have our db servers exposed to the internet thinking that having a hard to guess user and password was enough. But we found out that large part of our usage was automated scripts trying to connect to our db. Each time someone or something tries to login it uses some resources hashing or doing some crypto stuff, if you multiply that resource usage for thousands of times that a script can be trying to crack your password it ends up doing a lot of harm.

So use VPC peering or at least set the IP Access List in your atlas config, or anywhere you are hosting your DB.