Some scaling lessons I've learned growing to 120k+ users

serkandurusoy · December 18, 2017, 7:46am

It looks like your overall database schema design might be suffering from a similar problem like the profile field on users.

It looks like the chats collection holds messages as arrays whereas it might be more flexible and performant if messages were a separate collection.

As a rule of thumb, if you are going to keep adding data to a nested property or array and that nested property or array is likely to grow in time, you should make that a (set of) separate collection(s).

Of course there might be counter arguments to this based on certain query or app-db roundtrip optimizations, but from a a) db index/query performance and storage and b) publications and reactivity point of view, you should get better mileage with the separation.

jasongrishkoff · December 18, 2017, 8:35am

You’re spot on – my chatroom documents are stored individually because chatrooms build up really quickly; I keep 1 on 1 chats stored to a single one as they tend to not get very long.

serkandurusoy · December 18, 2017, 9:09am

Why not treat 1 on 1 chats as special chat rooms with 2 people? This would simplify and generalize your database and codebase, allowing for better maintainability and scalability.

raphaelarias · December 18, 2017, 12:10pm

To unblock publications helps a lot too.

For CRON jobs and webhooks we are migrating to AWS Lambda.

Redis oplog helps too.

Avoid keep resubscribing many times unnecessary (cache them).

satya · December 18, 2017, 1:19pm

great overview, exactly matches my experience.
Two more things:

Avoid observe and observeChanges at all times. It is very unreliable (crashes) when making lots of changes. Create your own polling system with setInterval and Meteor.call
Implement paging with Meteor.call instead of limiting publications.

raphaelarias · December 18, 2017, 1:41pm

What was the threshold for the observeChanges to stop working in your case?

serkandurusoy · December 18, 2017, 2:01pm

I have to strongly disagree with this statement. All out of box reactivity features in meteor rely on those and they are also heavily tunable (including batching, polling and even more).

MDG and community contributors have had years to tune these to many common and edge cases that most of us are not even aware of. A (naive) implementation with setInterval is highly likely to be much less well thought out.

Granted, reactivity does constitute a natural bottleneck to high scaling but should beat homegrown polling any day.

PS: These are not fanboy remarks as I am well aware of meteor’s limits and in fact that’s why I suggested that the author should create this thread in the first place.

evolross · December 19, 2017, 9:15pm

My tip: Enable the use of a CDN to deliver your app payload. Massive performance improvement if you have a lot of simultaneous first-time users.

Just curious, how many containers on Galaxy (and of what type) do you use to handle 120K+? Did you have to ask Galaxy Support to increase your container limits?

helloncanella · December 19, 2017, 10:01pm

Isn’t possible the data not to be available on the server when you call Meteor.call when componentWillMount ?

jasongrishkoff · December 20, 2017, 7:23am

Absolutely! Which is why you wait until you get a response to set the state – and once you have that state, then you can render what you want.

    constructor(props) {
        super(props)
        this.state = { analytics: false }
    }

    componentWillMount() {
        Meteor.call('analyticsSummary',function(error,response) {
            this.setState({analytics:response})
        }.bind(this))
    }

   renderSummary = () => {
        var analytics = this.state.analytics
        return (
            <div>The analytics have loaded!</div>
        )
   }

   render() {
        return (
            <div>{this.state.analytics ? this.renderSummary() : '...'}</div>
        )
   }

jasongrishkoff · December 20, 2017, 7:29am

Yep, that’s another great tip! I’ve set up Cloudfront to CDN all my static assets (JS, CSS, and images).

So… I’ve got 2 “Double” containers running (2.0 ECU and 2 GB ram each). I think the important distinction here is that I have 120k users – but they’re never all online at the same time. I think I peak around 250 active connections during prime hours. The containers seem to handle that fine because of the steps I’ve taken above (getting rid of unnecessary “publish” calls and offloading heavy tasks to a separate Digital Ocean container).

marxo · December 20, 2017, 1:30pm

Ensuring indexes can backfire quickly if you use it too much. It could eat up your RAM and go to swap and that’s not something you’ll love. So when using, use when really needed, not as a way to get away with writing poor performing queries.

hwillson · December 20, 2017, 5:04pm

We would LOVE to get a “Performance & Scaling” section into the Guide. There was some brainstorming work started a while back around this (see https://github.com/meteor/guide/issues/95), but that work has stalled. If anyone is interested in helping kick start that work back up, please post your ideas, comments, suggestions, etc. on that issue thread. The initial goal is to get together a rough outline that represents what a “Performance and Scaling” section would look like. Once we have a rough outline in place, we can then start working on the specific sections (and hopefully even flag volunteers to work on those specific sections). There is a lot of work to do here for sure, but forum posts like this definitely show how invaluable it would be to have this information all in one place.

satya · December 21, 2017, 8:20am

What was the threshold for the observeChanges to stop working in your case?

Observechanges initially works fine. But at a certain point in time it just stopped working. We handle triggers every couple of seconds, but a lot of clients are connected. So the servers handle multiple triggers a second.

We’ve done a lot of debugging, logging, Kadira, finetuning, but it did not help. Looking at the Kadira graphs, the CPU suddenly just spiked from about 3% to 100%, and then reactivity died. Sometime this happened after 3 days, sometimes after 10 days. Interesting fact is, the server did not really die as it was still serving the front-end just fine. I think the oplog monitoring just died.

The only solution was to restart the server constantly. We didn’t manage to fix it as it seems an issue in core Meteor functionality, so we decided to move to setInterval and Meteor.call and since then it is all running super smooth for months.

And it was not a single case. We had multiple Galaxy servers dying because of observeChanges.

hwillson · December 21, 2017, 10:39am

You might be hitting:

We’ve been looking at this issue a bit recently and have confirmed cases of it happening with Meteor 1.6.0.1. The issue is now pull-requests-encouraged if anyone is interested …

haojia321 · December 24, 2017, 4:39pm

Hi @jasongrishkoff ,
How did you manage session if you deploy meteor to different server?

jasongrishkoff · December 24, 2017, 4:53pm

Hi @haojia321 – whenever I deploy new client code it’s always to Galaxy, which seamlessly manages the deployment of new containers and ensures that any active users don’t have their current session disturbed. I also disable the “hot reload” that often comes with new code:

Meteor._reload.onMigrate(function() {                                                                      
    return [false]                                                                                         
})

Let me know if I misunderstood the question

hayk · December 24, 2017, 5:00pm

I have never seen this piece of code. Not even in the docs, though maybe I overlooked it. Could you explain it in more details or provide some links?

jasongrishkoff · December 24, 2017, 5:13pm

Sure, I can’t quite remember where I found it, but I believe I Googled something along the lines of “disable automatic reload in Meteor”.

Slind · January 10, 2018, 12:30pm

Interesting, I’m doing all these points already (besides Kadira) and using the same packages