Performance and stability problems at scale

Huh?

It cannot get more realtime then firing off a method call just when you need fresh data (eg when you switch to a new route).

On the contrary, I do believe that pub/sub is slower every freaking time!

That is our experience when we did the switch and we never looked back at using pub/sub every since. Just seeing how much faster our app is and the positive comments from our users is enough reason.

I fully agree with that view, excellent quote

4 Likes

I would avoid Meteor for real time at scale.

My experiences from 5 years ago:

Since then I’ve moved away from Meteor.

It cannot get more realtime then firing off a method call just when you need fresh data (eg when you switch to a new route).

that is not what is meant with “realtime”. In the context of meteor and webapplications we understand “realtime” as getting changes on the domain state as soon as possible on the client. Loading data on route change is trivial and you don’t need DDP for that. This has been done since the dawn of the www.

But i would also argue that often you don’t need the realtimeness of pub/sub as it comes with a significant cost

1 Like

yeah, we do that too. But standard meteor is still very memory / cpu intensive and profits from better hardware. Horizontal scaling with meteor needs a lot of work, e.g. when using redis-oplog, you need to be careful when you do a lot of updates regularly as it will inform ALL pods/containers about update unless you use channels. But having channels just adds another layer of complexity. So yes, of course, you can make it work, but at what cost? The very existence of redis-oplog is an alarming sign in my opinion and should have been integrated into meteor itself or made obsolte by a better approach in meteor core.

Also having more instances adds load to the mongodb, which will be your next bottleneck as @elie s blogpost also states.

I just wanted to give a warning, because I went through a lot of pain and i am a bit worried that the reports of scaling problems with meteor still pop up regularly. Tiny should tackle these problems in my opinion and at least make clear statements what to do and what not to do.

2 Likes

Just a note, the disableOplog: true option is a flag to be passed to collection.find. @radekmie did you already tried to use this flag in said publications? I think the described use-case is similar to yours:

2 Likes

Wow, that’s a lot of feedback! Thank you all! I let myself comment on all of them in one comment not to spam too much. I’ll also skip some parts of your comments if the answer is already there.

@minhna
You can load the documents via methods then insert/update them to/in minimongo. Then other parts will be the same.

That’d require a lot of client-side changes to make sure that all of the code is using the local, method-populated collection. Again, an interesting idea, though a hard one to implement in a huge app.

@vooteles
Interesting topic. You might want to give this package a try: adtribute/pub-sub-lite

I saw that some time ago. It was also proposed by @jkuester. The problem is that it does not support any real-time updates. I may, however, experiment with pooling using this method.

@renanccastro
I did this package some time ago aiming to solve this exact same issue.

I’ll definitely try it if we’ll ever use redis-oplog!

@znewsham

  1. if you send a single very large message, the server can hang while it processes it […]
  2. garbage collection […]
  3. BSON parsing […]
  1. We have the exact opposite problem - we do send a lot of small messages. There are some large ones, but they do not correlate with the CPU spikes at all.
  2. We are at ~12% (~1GB) of RAM almost constantly. That’s far under the default limit (~1.3GB) and we do use --max-old-space-size=3072 anyway.
  3. It is the case with these larger messages from 1., originating from our extensive usage of aggregations. Again, it does not correlate with the CPU spikes.

@don1231
Here’s a bit of a crazy idea. Not sure if this will work since I’ve never tested this.

I’m not sure how to implement it at all. Meteor has to populate the merge box on the server anyway and it’s probably not possible to do it from the client side (the already downloaded data). If you’d have an idea on how to do that - let me know! (We’re hosted on Galaxy - it may be a little problem with separate cluster.)

@efrancis
We really would need to see your Meteor APM or Monti APM info to help debug this type of thing. […]
Does one user logging in slow the app for all users? […]
What is your data model? […] If you could share your full data model, […].
As a side note, Meteor likes big containers because CPU usage can be high, so I’ve found that a few large servers performs better than a lot of small servers.

  1. Of course, I’ll attach them at the end.
  2. It doesn’t make the app slow at all. The app is unresponsive at all, for a couple seconds. And no, not for all users, but for all on the same container.
  3. It’s highly normalized with occasional denormalization of rarely changing data. I can’t share the full data model because of the business agreements, as well as the fact that’s far too large to analyze here (>60 collections).
  4. We are currently using Galaxy’s Octa servers (8.3 ECU) - these are already quite strong. However, I do agree with the comments from Galaxy performance vs AWS - it feels like these Galaxy units are somehow limited.

@macrozone
[…] Is it because a lot of users log in at the same time or because a lot of users subscribe at the same time to a lot of documents? […]
[…] Careful when updating a lot of documents at the same time regularly. […] Solution is to use disableOplog: true on subscriptions that need this data. […]

  1. Neither. The problem occurs when certain users (with relatively high number of documents) log in.
  2. We’ve never had problems with that as the number of inserts/updates/deletes in our case is relatively low.
  3. I’ve tried enabling disableOplog: true for all publications. It did not affect the loading time in anyway, both on client and server.

Here’s a couple of APM screens (highly censored, sorry).

Screenshots

We have quite a few of such weird method and publication traces. My guess is that the CPU is so busy, that it actually affects the APM metrics.

A standard one looks fine.

Data from today.


Data from yesterday.








Broader scale.


2 Likes

Lots of interesting ideas but I want to suggest you focus on finding the root cause before trying solutions.

This will be difficult but try to put together a test case that reproduces this reliably in a test environment. Use selenium web driver to simulate multiple clients and perhaps AWS instances for clients. Once you have the issue reproducing in test env, focus on profiling the server and client and finding what part is so slow on login. You may need to do process of elimination to see if you remove a part, stuff starts working as expected. When you have the root cause, solving will then be possible.

2 Likes

I’ve found it already - logging in. If I remove publications entirely, it works like a charm, even with hundreds of users logging in at once.

3 Likes

How is it the other way around? Did you try to remove the logins, and keep just the publications?

What do you mean by “remove logins”? If no users are logging in, the system is stable and performs well.

1 Like

I’ve often seen ‘login’ as the biggest spike when viewing our methods in Monti but never investigated too much. Could it be that login is just blocking everything else while it’s running? I’m not familiar with how how it’s implemented but in Monti it shows as a method so seems likely that by default it would block. Might there be a way to make this method unblock and allow the rest of the system to keep functioning while processing a login?

Failing that could login be handled in two stages. So in the first stage you load only the absolutely necessary stuff like permissions/roles which loads a ‘logging in’ page and once those subscriptions have completed it moves you to the main view which then triggers the rest of the subs ?

1 Like

Do you have a lot more than 300k lines of code? Because that’s how much we had when we went through it. That was just two people BTW, not a large team!

1 Like

You could build a test version of your app that requires no login at all. Then you could use selenium or a similar product to simulate load created by simultaneous users, that subscribe data just like real users would.

2 Likes

Yes, it could be. Login with username+password leverages bcrypt, which is by design heavy on the CPU. Given that Node.js is single threaded, such a long operation leads to blocking the main thread. Theoretically this operation could take place in worker threads, or even in a remote service that leverages multiple CPUs, but no such thing is built into Meteor.

1 Like

From this discussion and those APM screenshots it sounds like you need to reduce the amount of documents being subscribed to, and ideally the overall number of subscriptions. This really seems like a data model and/or over-subscribing problem that there isn’t some magical fix for.

It’s hard to give a real solution because we can’t see your full data model and we can’t know your business requirements for why you may need to load 1000+ documents on pageload, but I will say that there is probably a better way to handle this. I would step back and try to re-asses what data you really need to load, and how the UX would be impacted if some things were not loaded immediately. There are very few legitimate reasons to need that many documents subscribed to when a user logs in.

I would try to think about what data loading can be delayed until after the user does some interaction, what data can be paginated or loaded in segments, what can be lazy loaded on demand and not upfront, what can be moved to polling instead of live subscriptions, etc. You may even need to change your UX a little bit to accomodate what you’re trying to do, but I just don’t see any way you’re going to magically get thousands of documents subcribed by each user without a serious change in your data model.

The only thing that is easy to try and may have a big impact is the cultofcoders:redis-oplog package. If you haven’t yet I would definitely get a preprod environment up, populate your preprod db w/a prod db dump so that you can recreate the exact same issue your users are seeing, and then try enabling redis-oplog instead of Mongo oplog. It’s simple to try and may yield good results, if not only a few hours wasted.

4 Likes

There’s in the meantime a new fork of the original cultofcoders:redis-oplog: it’s the ramezrafla/ redis-oplog that introduces new major performance optimizations.

Results

  • We reduced the number of meteor instances by 3x
  • No more out of memory and CPU spikes in Meteor – more stable loads which slowly goes up with number of users
  • Faster updates (including to client) given fewer DB hits and less data sent to redis (and hence, the other meteor instances’ load is reduced)
  • We substantially reduced the load on our DB instances – from 80% to 7% on primary (secondaries went up a bit, which is fine as they were idle anyway)

I would definitely give this a shot if I had a major scalability / performance problem with Meteor pub/sub.

7 Likes

Do you store any custom data on the user objects? If you store lots of data on the user object then meteor core pulls it all from the db when a user logs in: Meteor Guide - Preventing unnecessary data retrieval.

I provided a work-around for this in Meteor 1.10: https://github.com/meteor/meteor/pull/10818.

3 Likes

Thanks @peterfkruger

@radekmie our version of redis-oplog was specifically designed to handle large loads. Give it a shot (Note: the only thing missing is geospacial updates, which we will look into shortly).

Most production apps put a db cache layer between their application and db, our redis-oplog does that natively, caching data subscribed to. It should be an easy swap with the existing redis-oplog.

@wildhart,
You are right, so many data pulls occur. Our redis-oplog caches user data, so no necessary data pulls.

4 Likes

does it work with redis cluster? I’m using cult-of-coders/redis-oplog, it comes with redis 2.8 and it doesn’t support redis cluster.

I’ll do another round and some summary.

@marklynch
I’ve often seen ‘login’ as the biggest spike when viewing our methods in Monti but never investigated too much. […]
So in the first stage you load only the absolutely necessary stuff like permissions/roles which loads a ‘logging in’ page and once those subscriptions have completed it moves you to the main view which then triggers the rest of the subs ?

  1. We can see that as well, in many of our apps. But I think it’s not really true - it’s just how the APM works and it kind of “merges” the following subscriptions and method calls into the login. Or at least it looks like it in our case.
  2. That’s something we went with and it really helped. I’ll write a little bit more at the end of my post.

@a4xrbj1
Do you have a lot more than 300k lines of code? […]

I was curious myself. And we are almost there: 335k lines in total, 295k without blank lines and comments.

@peterfkruger
You could build a test version of your app that requires no login at all. […]
Login with username+password leverages bcrypt, which is by design heavy on the CPU. […]

  1. It won’t work as it’d either require rewriting a big chunk of the app or calling the login method automatically, that doesn’t really make sense, as that’s exactly what Meteor does with login tokens.
  2. I’ve been working with Meteor for almost 6 years now and I’ve never seen Bcrypt taking significant amount of CPU, even in lightweight apps with thousands of users logging in at once. But maybe it’s just me.

@efrancis
From this discussion and those APM screenshots it sounds like you need to reduce the amount of documents being subscribed to, and ideally the overall number of subscriptions. This really seems like a data model and/or over-subscribing problem that there isn’t some magical fix for.

We know that, but as I said, we’d rather look for anything that could help us in the meantime. And no, we weren’t looking for a “magical fix”, but rather a temporary workaround. In the end, loading times are not a problem – unresponsive servers are.

@wildhart
Do you store any custom data on the user objects? […]

No, not much. Only the “usual Meteor stuff” and some information about the tenancy.

@ramez
@radekmie our version of redis-oplog was specifically designed to handle large loads. […]

As I said, we’re already planning to use such a package and yes, we’ll try yours as well. Thanks for sharing!


I think we’re fine, at least for the time being. What helped was some kind of scheduling and throttling of the publications. To be exact, we have quite a few places where we do a couple of Meteor.subscribe calls at once (up to 15!). Before we’ve waited for all of them to be .ready(). Now, we do not call the next subscribe as long as all the previous subscriptions are not ready. It made loading times longer, but spreading it like that made our servers responsive at all times.

Thank you all for your time! I hope the entire community will benefit from the ideas (and packages) shared in this thread!

3 Likes