I executed a fairly intensive database update (a few million records inserted/updated over about 15 minutes) today and 4 out of the 8 servers that observe this database “hung” (for about 30 minutes). 2 of these servers, do literally no other work - no connections, they are just meteor servers setup to look at the DB (we’re in the process of moving our workload to them).
They didnt crash (or they would have restarted), there were no memory issues, CPU usage was “highish” (50-60% initially, dropping to 15% after about 10 minutes) but IO usage was insane. Running iotop
suggested that the two servers that should have been idle, were reading/writing to disk at 20MB/s, using 96% of the available bandwidth. A quick check in AWS shows that it was indeed disk, and NOT network usage that was slow.
I have 3 questions
Why would meteor be using a crazy amount of disk IO? My guess here is that something to do with the oplog was being piped to disk by meteor for reading later.
Why would 2 identical servers (with nothing and no-one using them) have different recovery times, around 15 minutes for one, 30 for the other. It seems like docker may have restarted the container (which caused the recovery)
How can I defend against this? Is there a way to limit the amount of IO usage that meteor uses for oplog tailing? Even if it means we sometimes serve stale data? The issue here being the servers in question didnt respond to any requets while this was ongoing.