Tonight we’re having peak traffic on our site. We’re running 40 DO droplets, but the issue seems to be the Mongo database is taking far too long to respond according to Kadira and the fact that it takes along time to do things even on a server that has no users connected to it.
What are common strategies to deal with such issues?
Finally, MongoDB is really not a reactive db. Oplog trailing is a design mistake as it’s not scalable. If you are sharing your DB with all your DO instances, each additional instance only offers incremental improvement, as it has to watch the activity of ALL users to detect its own.
We are migrating Meteor to RethinkDB which has built in reactivity.
799 active connections is a problem with mongodb because it creates a thread per socket. In case all these connections are active which they typically are since the drivers are sending ping and ismaster commands so regularly. The amount of context switches is just insane.
Also does compose actually provides its users the Machine specs, that is how much cores are you actually running on? This is very important because if your instance is only pinned to one core worst case VCPU and you have 799 active threads, this will not go very well for you.
Take down about 20 instances and watch how many connections are dropped. I myself do not develop with meteor so i am not sure if the framework provides a way for you to pass the mongo options down to the driver. It should else it wont make sense.
Request per second at the load balancer. I will think you have some sort of proxy routing request to the 40 instances and curious what the RPS is at the moment. Typically an inactive meteor instance opens about 14 connections to the mongodb. For your case that is 560 connections already and this is for the case where the meteor instance has zero traffic so the 799 adds up. Unfortunately it is too much.
Using Nginx as the load balancer, but I’m not sure how I find RPS. We had about 1000 connected users at the time, and now it’s far less, but the issues have persisted.
There is one essential worker instance that is doing a lot of writes to the database.
Things have settled a little right now, but it’s been two hours of hundreds of complaints and things are still slow.
Thanks for the help. Pretty certain the issue was as you said and I had far too many connections to the database. I’ve reduced that number now and also increased the RAM for the deployment.