Dropped DDP Calls Due To CPU Spikes? How to tell if Galaxy container hits 100% CPU momentarily?

evolross · December 9, 2020, 6:21pm

Does anyone know how I would tell if a Galaxy container hits 100% for just a second or so?

I have a real-time game app with thousands of users and I’m getting what appears to be some dropped DDP messages, where players say their app doesn’t get updated and goes out of sync. It only happens for a few players, very intermittently, and I think I’ve traced it to when my app has a lot of usage when thousands of players all hit the servers at the same time - when joining a game. The CPU gets spiky but Meteor APM and Galaxy never show it hitting 100%. It’s usually hanging around 15% CPU usage then with quick spikes to 40% - 60%. Never even close to 100%.

Galaxy and Meteor APM show a general CPU usage chart but I don’t think it has the fidelity to show if your container peaks just momentarily. For one, the server gets so busy it probably doesn’t register itself at 100% and 2) the chart doesn’t have the detail to show it for just a few seconds even if it could.

The same goes for Galaxy’s charts. Even on the 5m setting it looks like it samples every five seconds. Which could miss the CPU being momentarily at 100%. And it’s usually only way after the fact that I hear about issues, thus missing the five minute window to view any CPU peaks in Galaxy.

It feels like we need some better tools here. Is there any tools for this to get a true real-time CPU usage? I don’t think you can remote into a Galaxy container right? To use command line tools.

Anyone had any experience with Meteor seeming to drop DDP calls? I’m not sure if the problem is the server not sending them or the clients not receiving them, or both. I use redis-oplog and I’m pretty sure the problem isn’t Redis as its CPU and activity are very low, even in peak times.

I was wondering, if this ends up being the problem that my temporary CPU spikes cause the servers to drop DDP messages, thus breaking my app for a few users momentarily, would switching to raw AWS and using their CPU burst-able containers be an approach to solving this? Does Galaxy perhaps already use these container types?

vooteles · December 9, 2020, 10:37pm

Perhaps this might be of interest to you:

Have not used it myself, so might be off base here. And if you’re trying to catch an infrequently existing spike, then might be difficult to target the right moment with this.

znewsham · December 10, 2020, 2:17am

It’s extremely unusual for meteor to drop ddp methods in it’s default configuration. The only way I can see that “meteor” would drop them is if you had configured your methods to run without retries and the client briefly disconnects, which could happen if the server is at high cpu for the duration of a ping, which I believe is 1 minute.

In the default configuration the client will retry the method when the connection starts up again

Assuming you didn’t disable retries? It feels more likely that something in your application code is causing this. If you were to call unblock in a method the client would immediately register that it had ran. Do you log your ddp traffic at all? Is it possible your clients are having connectivity issues?

peterfkruger · December 10, 2020, 11:52am

Are we sure about this? I use this.unblock() in several methods that have a return value, which is in each case actually returned to the client upon successful completion. Meaning that this.unblock() doesn’t cause the effect you hinted.

@see https://docs.meteor.com/api/methods.html#DDPCommon-MethodInvocation-unblock

znewsham · December 10, 2020, 12:18pm

For sure the value will be returned I just think it might impact the retry behaviour. Though I could be wrong

filipenevola · December 10, 2020, 5:02pm

Hi @evolross did you verify that are messages being dropped or is it just a suspicious that you have about the root cause of your issue?

I’m bringing this up because I never saw a message being dropped but I know that are some cases where Redis Oplog causes this. AFAIK the root cause of this issue with Redis is not known yet but it happens usually on high usage moments.

About your question on Galaxy metrics, we extract metrics from Docker and it would provide 100% or any high number, the metrics are always available even if the container is really busy then you can trust in the metrics that you see on Galaxy.

You can always ask questions or bring issues like this to Galaxy support as well (support@meteor.com).

evolross · December 10, 2020, 5:37pm

@znewsham I didn’t even know you could disable method retries. How/where does one do that just out of curiosity

It’s totally suspicion. The app works perfectly fine literally 99% of the time. It’s just these few intermittent cases and I’ve witnessed it happen myself. So it shouldn’t be a code issue if the app works for the large majority. Perhaps something like a this.unblock() is going out of whack with a high amount of server load, perhaps taking too long to respond causing some sync issue on the client.

This tracks pretty close to what I’m experiencing. Where are you reading about this at? Would love to research more about it.

znewsham · December 10, 2020, 5:58pm

you can pass in { noRetry: true } to the options of .call or .apply

ramez · December 16, 2020, 12:48am

I would be amiss not to share our findings with the community. I was waiting for multiple users to confirm that our solution was production-ready for them.

That’s right, when we faced this issue with redis-oplog we decided to create our own “scalable” redis-oplog which you can find here. We no longer face those CPU spikes or disconnects from redis.

The root causes & solutions (we think):

Each redis signal received requires a DB data pull from the listening instance. So the more users you have the more data is being repulled from the DB. Our redis-oplog always sends the changed fields AND their values in the redis signal.
We do a diff-ing before sending a signal to redis and only send fields that changed. If no fields have changed, we send nothing.
A number of Meteor.defer were added in the signal listeners to delay triggering observers
Meteor.defer was already present in the dispatching of redis signals, but that was not enough

Hope that helps other solve this issue as they scale up.

PS: Please read the README carefully if you decide to use this new redis-oplog

peterfkruger · December 16, 2020, 2:02am

A big kudos to you and to everyone else involved – this is a major improvement over the original version, and fixes lots of severe problems.

One question for now though:

We want to read from MongoDB secondaries to scale faster.

I am a bit worried about this feature. Secondaries in a MongoDB replica set are strictly eventually consistent, with no guarantees whatsoever to when the state of consistency is ultimately reached. Couldn’t a reading from secondary lead to a scenario where the update (or insert etc.) sometimes isn’t picked up, because the member used for reading did not have the chance to receive and execute the latest changes from the primary yet?

ramez · December 16, 2020, 2:36am

Hi @peterfkruger,

Thanks for the question, you are right, race conditions could be a problem. There is very very very very small chance that you subscribe to a data stream right at the same time as it gets changed by another instance (in other words, you are too late to get a redis change signal and too early to get the updated data from secondary DB instances).

There are a number of ways around this:

You don’t have to read from secondaries, the fact that we optimised our app for secondaries doesn’t mean you have to. The scalable redis-oplog works just fine regardless.
You can force a read for important data from primaries
Mongo has settings to make sure secondaries have high consistency with primary. Not sure if that works here (see #5 below)
We were thinking about adding a LUA script in redis that holds the data for a short timestamp and sends it to whomever subscribes to it make sure they see the latest changes (in case it did not make it yet to the secondary DB instances)
An ‘unclean’ approach is to delay all redis signals to give secondaries the chance to update

However, the cost of worrying about race conditions is huge, you are overloading everyone (your meteor instances to keep pulling data, your DB with frequent pulls of the same data). If you can design your app to be resilient, you would benefit greatly and scale faster

peterfkruger · December 16, 2020, 10:04am

Excellent, thank you. As long as there are viable strategies to mitigate the problem, I’m happy. In particular, being able to force reading from the primary in mission critical cases where an inconsistency due to a possible race condition needs to be ruled out completely, seems very important.

It would be great if it could be at some point outlined in the documentation how this can be done properly, including a warning that when falling back to reading from the primary, scalability would potentially be impeded.