Meteor IO usage insane

I executed a fairly intensive database update (a few million records inserted/updated over about 15 minutes) today and 4 out of the 8 servers that observe this database “hung” (for about 30 minutes). 2 of these servers, do literally no other work - no connections, they are just meteor servers setup to look at the DB (we’re in the process of moving our workload to them).

They didnt crash (or they would have restarted), there were no memory issues, CPU usage was “highish” (50-60% initially, dropping to 15% after about 10 minutes) but IO usage was insane. Running iotop suggested that the two servers that should have been idle, were reading/writing to disk at 20MB/s, using 96% of the available bandwidth. A quick check in AWS shows that it was indeed disk, and NOT network usage that was slow.

I have 3 questions
Why would meteor be using a crazy amount of disk IO? My guess here is that something to do with the oplog was being piped to disk by meteor for reading later.

Why would 2 identical servers (with nothing and no-one using them) have different recovery times, around 15 minutes for one, 30 for the other. It seems like docker may have restarted the container (which caused the recovery)

How can I defend against this? Is there a way to limit the amount of IO usage that meteor uses for oplog tailing? Even if it means we sometimes serve stale data? The issue here being the servers in question didnt respond to any requets while this was ongoing.

Are you using redis-oplog? Regular oplog will be painful ideed.

No, we’re not using redis-oplog, but could you elaborate, it is one of the options we’re exploring to mitigate these problems

Regular mongo oplog is very expensive CPU-wise and does not scale well. At some point you are getting diminishing returns (each additional server adds so much load on the remaining servers you actually need a bigger server to mitigate). Redis-oplog takes care of that by using a pub/sub bus instead of watching mongo oplog.

I misread your message - I thought “redis oplog will be painful indeed” - yes, it is regular oplog, and I’m aware of the CPU problems, these are OK for now (the sync is very late at night when not much is going on and only lasts 30 mins-1 hr). The problem is memory usage increases and IO usage increases to the point that the server hangs (doesn’t crash, just hangs). My guess is meteor is caching the oplog changes for later use

{reactive:false} in your DB calls?

Won’t help, its not the queries - the servers in question have no sessions, its literally just observing the changes.