Some scaling lessons I've learned growing to 120k+ users

Hey there! So I’ve been running my website w/ Meteor+React for the last two years, and I thought I’d share some useful tips that I’ve learned along the way with regards to scaling. Without further ado:

  • Offload background tasks (ie, cron jobs) to a separate “slave” server that the clients don’t use. My main website is hosted on Galaxy, but I set up a $15 Digital Ocean droplet and did a mup deploy with an environment variable that specifies it should only handle cron jobs

  • Avoid “Meteor.subscribe” wherever possible; use “Meteor.call” instead. It’s much faster, as well as being easier on the server’s CPU and memory. If using React, you can fetch the data in componentWillMount and set the state as follows:

    componentWillMount() {
        Meteor.call('analyticsSummary',function(error,response) {
            this.setState({analytics:response})
        }.bind(this))
    }
  • When fetching data, limit it to the fields you’ll use – things will load much faster. Try do this as often as you can – even for background server methods that run as cron jobs.
    Chats.find({user:Meteor.userId()},{
        fields: {
            messages:{$slice:-40}
        }
    }).fetch()
  • Avoid saving large amounts of data to Meteor.users. Offload it to a separate collection, where possible [link]

  • Index all of your mongodb queries – huge speed ups here

    Meteor.startup(function() {
        Analytics._ensureIndex({'date':1})
    })
  • Use Kadira to find out where your slow methods and subscriptions are. Keep an eye out for high response times. To really hone in on bottlenecks, I put console.log flags that measure how much time has passed since firing off the method.
    archivedNewToString() {
        var startDate = new Date()                                                                                          
        var date = new Date().addHours(-1)                                                                             
        // Find any chats that are saved to the database as arrays                                                                           
        var chats = Chats.find({                                                                                       
            'archived.0':{$exists:true},                                                                               
            'date':{$gte:date},                                                                                        
        }).fetch()
        console.log('step 1',new Date() - startDate) // returns milliseconds                                                                                        
        chats.map((chat) => {                                                                                          
            var archived = chat.archived.toString()                                                                    
            Chats.update({_id:chat._id},{$set:{archived:archived}})                                                    
        })
        console.log('step 2',new Date() - startDate)                                                                                                             
    }
  • When sifting through larges amounts of data, aggregations are your friend. I use this package to enable them in Meteor.
     // The campaigns collection has nearly a million documents
    campaignsCompletedTrailingMonth() {
        this.unblock()
        var total = 0
        var time  = new Date().getTime()
        var date  = new Date(time - (1000 * 60 * 60 * 24 * 30))
        var campaigns = Campaigns.aggregate([
            { $match: {
                date:{$gte:date},
                'summary.expired':true,
                'summary.type':'premium',
            } },
            { $group: {
                _id: '$summary.type',
                completed: { $sum: { $size: { $ifNull: [ '$blogs', [] ] } } },
            } }
        ])
        var total = campaigns[0].completed
        Stats.upsert({'name':'campaigns-completed'},{$set:{'total':total}})
    }

Those are some of the more recent ones on the top of my head. Would love to hear any tips from y’all! And I’ll be sure to update this as I remember additional steps I took.

44 Likes

thanks you for shareing this information.

I would suggest an unordered bulk operation instead of single update calls in your archiving function.

2 Likes

It looks like your overall database schema design might be suffering from a similar problem like the profile field on users.

It looks like the chats collection holds messages as arrays whereas it might be more flexible and performant if messages were a separate collection.

As a rule of thumb, if you are going to keep adding data to a nested property or array and that nested property or array is likely to grow in time, you should make that a (set of) separate collection(s).

Of course there might be counter arguments to this based on certain query or app-db roundtrip optimizations, but from a a) db index/query performance and storage and b) publications and reactivity point of view, you should get better mileage with the separation.

4 Likes

You’re spot on – my chatroom documents are stored individually because chatrooms build up really quickly; I keep 1 on 1 chats stored to a single one as they tend to not get very long.

Why not treat 1 on 1 chats as special chat rooms with 2 people? This would simplify and generalize your database and codebase, allowing for better maintainability and scalability.

To unblock publications helps a lot too.

For CRON jobs and webhooks we are migrating to AWS Lambda.

Redis oplog helps too.

Avoid keep resubscribing many times unnecessary (cache them).

1 Like

great overview, exactly matches my experience.
Two more things:

  1. Avoid observe and observeChanges at all times. It is very unreliable (crashes) when making lots of changes. Create your own polling system with setInterval and Meteor.call

  2. Implement paging with Meteor.call instead of limiting publications.

1 Like

What was the threshold for the observeChanges to stop working in your case?

I have to strongly disagree with this statement. All out of box reactivity features in meteor rely on those and they are also heavily tunable (including batching, polling and even more).

MDG and community contributors have had years to tune these to many common and edge cases that most of us are not even aware of. A (naive) implementation with setInterval is highly likely to be much less well thought out.

Granted, reactivity does constitute a natural bottleneck to high scaling but should beat homegrown polling any day.

PS: These are not fanboy remarks as I am well aware of meteor’s limits and in fact that’s why I suggested that the author should create this thread in the first place.

9 Likes

My tip: Enable the use of a CDN to deliver your app payload. Massive performance improvement if you have a lot of simultaneous first-time users.

Just curious, how many containers on Galaxy (and of what type) do you use to handle 120K+? Did you have to ask Galaxy Support to increase your container limits?

1 Like

Isn’t possible the data not to be available on the server when you call Meteor.call when componentWillMount ?

Absolutely! Which is why you wait until you get a response to set the state – and once you have that state, then you can render what you want.

    constructor(props) {
        super(props)
        this.state = { analytics: false }
    }

    componentWillMount() {
        Meteor.call('analyticsSummary',function(error,response) {
            this.setState({analytics:response})
        }.bind(this))
    }

   renderSummary = () => {
        var analytics = this.state.analytics
        return (
            <div>The analytics have loaded!</div>
        )
   }

   render() {
        return (
            <div>{this.state.analytics ? this.renderSummary() : '...'}</div>
        )
   }

Yep, that’s another great tip! I’ve set up Cloudfront to CDN all my static assets (JS, CSS, and images).

So… I’ve got 2 “Double” containers running (2.0 ECU and 2 GB ram each). I think the important distinction here is that I have 120k users – but they’re never all online at the same time. I think I peak around 250 active connections during prime hours. The containers seem to handle that fine because of the steps I’ve taken above (getting rid of unnecessary “publish” calls and offloading heavy tasks to a separate Digital Ocean container).

2 Likes

Ensuring indexes can backfire quickly if you use it too much. It could eat up your RAM and go to swap and that’s not something you’ll love. So when using, use when really needed, not as a way to get away with writing poor performing queries.

5 Likes

We would LOVE to get a “Performance & Scaling” section into the Guide. There was some brainstorming work started a while back around this (see https://github.com/meteor/guide/issues/95), but that work has stalled. If anyone is interested in helping kick start that work back up, please post your ideas, comments, suggestions, etc. on that issue thread. The initial goal is to get together a rough outline that represents what a “Performance and Scaling” section would look like. Once we have a rough outline in place, we can then start working on the specific sections (and hopefully even flag volunteers to work on those specific sections). There is a lot of work to do here for sure, but forum posts like this definitely show how invaluable it would be to have this information all in one place.

12 Likes

What was the threshold for the observeChanges to stop working in your case?

Observechanges initially works fine. But at a certain point in time it just stopped working. We handle triggers every couple of seconds, but a lot of clients are connected. So the servers handle multiple triggers a second.

We’ve done a lot of debugging, logging, Kadira, finetuning, but it did not help. Looking at the Kadira graphs, the CPU suddenly just spiked from about 3% to 100%, and then reactivity died. Sometime this happened after 3 days, sometimes after 10 days. Interesting fact is, the server did not really die as it was still serving the front-end just fine. I think the oplog monitoring just died.

The only solution was to restart the server constantly. We didn’t manage to fix it as it seems an issue in core Meteor functionality, so we decided to move to setInterval and Meteor.call and since then it is all running super smooth for months.

And it was not a single case. We had multiple Galaxy servers dying because of observeChanges.

1 Like

You might be hitting:

We’ve been looking at this issue a bit recently and have confirmed cases of it happening with Meteor 1.6.0.1. The issue is now pull-requests-encouraged if anyone is interested … :slight_smile:

1 Like

Hi @jasongrishkoff ,
How did you manage session if you deploy meteor to different server?

Hi @haojia321 – whenever I deploy new client code it’s always to Galaxy, which seamlessly manages the deployment of new containers and ensures that any active users don’t have their current session disturbed. I also disable the “hot reload” that often comes with new code:

Meteor._reload.onMigrate(function() {                                                                      
    return [false]                                                                                         
})

Let me know if I misunderstood the question :slight_smile:

1 Like