I wanted to write this to share experience and probably to get an understanding, and help somebody take right decision.
I have a production meteor application in Education domain with around 5000+ subscriptions. The application gets an average hit of 1500 users visiting it daily (Google Analytics) and at peak has approximately 150 active users at any time (from Google Analytics). This was being handled by an App server running meteor app which was a 2GB/1vcpu Droplet and a dbserver running Mongodb on a separate droplet, again a 2GB/1vcpu. Both in different subnets but in same data center (Singapore). The app server talking to db server using private Ip.
Everything was running fine and the servers were happily taking the load thrown to it. I had read about servers reaching capacity due to meteor subscriptions so was actively monitoring things. But things were good with occasional spikes but nothing alarming. Then about 6 weeks back one fine day both Servers suddenly hit peak using 100% CPU. Now I assumed some rogue process was running, so restarted meteor app. But within couple of minutes again they went up to 100%. Somehow that day went by, and next day things went back to normal. Things were ok. I was checking DO panels to look at graphs and there was an alert about some network problem in Singapore, I did not pay much attention as things were back to normal.
Things were ok for another two weeks. In India there was Dusshera holidays and things were quite. As soon as System started getting usage after holiday, the app server droplet started behaving strangely, now and then were hitting 100% CPU and I was getting alerts. Now we were in a fix, things were good and with very little load of 50 active users the droplet was hitting 100% cpu, the dbserver however was having hardly any load. I thought it looks like we are reaching meteor limits for single CPU, so decided to load balance it using nginx, with two instance of meteor app and increased the droplet size to 4GB/2vcpu. After this things became normal.
Then suddenly last saturday, the app server droplet stopped responding, neither accessible through DO online console nor able to ping it or access using ssh. Now we were in panic mode as it stopped responding on Saturday at 7pm IST - usually when folks access the app. The worst thing we then realised there was no way of contacting DO support, we tried raising ticket, but no acknowledgement. In desperation after two hours of no response from DO for the four tickets we raised, tried powerdown of droplet. Luckily it worked the appserver could be logged, but when we started the app it started erroring out as the private ip of dbserver was not accessible. But then we found the public ip was reachable. So changed appserver to use public ip to access dbserver. And somehow managed to save the day. From DO there was absolutely no response noR acknowledgement for the the ticket. Next day morning got a mail from DO saying that Appserver droplet machine had problems and they are migrating the droplet to a new server. Even after 12 hours there was hardly any update on the tickets. The first response for the ticket came after 18 hours.
Let me summarise my learning:-
DO is good when things are good, but sucks if there is a problem. They are not reachable nor they respond to any query, utterly blackbox. One is left to fend oneself.
Meteor app can behave strange if there is problem in connection between DB and App server.
we suspect we have unnecessarily moved to a bigger droplet, whereas the problem was the droplet’s underlying machine.
Now I would be curious to know
When Meteor app is used with DB as a Service, does the network speed results in CPU spikes. and one is forced to provision a higher capacity App server
Have you had bad experience with DO
is it good to have a prod app on DO