For the last 3 years, I have been running a service for a client on Galaxy. It’s a simple B2B online ordering platform. If there is any significant downtime it may cost my client potential extra shifts and downtime. I’m a one man band so I decided to host this using Galaxy.
For reference the Galaxy setup was always running on at least 2, often 3 compact (512MByte) or even the 1GB containers as I hoped this could fix the issues.
Speed issues
Over the last 4 months, I had repeated requests that the performance of the app was really slow. There are three pages (dashboard, ticket, client-admin) that all use different subscriptions/filters so the time to for all subscriptions to be ready
took between 4 and 6 seconds on average but about 1 in 10 connections experienced a spike that took on average around 40-50 seconds. A staff member has to open these tickets often whilst on the phone with impatient clients and this was not acceptable.
I initially thought this might be because I was hosting on a shared mlab server and that other requests may delay this processing.
I ported the database to mongodb’s Atlas where I have an M10 cluster for other cilents.
The time to subscriptions being ready became better, with the normal subscriptions taking around 2-3 seconds instead, even though this was now in a different AWS data centre. The spikes remained and did not seem to improve.
I then deployed the App on a different host (a 1GByte Vultr VPS) but the same Atlas DB so the client could test this internally. The normal subscriptions were now ready within less than 1 second. And I have no evidence of spikes taking any noticeable delay.
(Un)healthy Containers
Occasionally, one of the Galaxy containers restarted due to the cluster being unhealthy. This would usually show that one of the containers was sitting without free memory for a few minutes. In rare instances, this meant that a client ticket was not completed and as the page reloaded the client would have to repeat this again.
I had spent days trying to identify the cause for these memory spike (which did not coincide with the subscription issues). They had been a major cause for me not to try and move this service to Vultr entirely, as there’s significant cost if its not running and I’m unable to restart it.
However:
- Except for deployment outages of 1-2 minutes, I have 100% uptime on all 9 of my Meteor services that I’m running using mup on Vultr (for over 18 months).
- This application, which has now been running on vultr for 2 weeks has not shown any of the signs of sudden CPU or Memory usage, which makes me think whatever process is causing this is not part of my app but something in Galaxy.
Any suggestions, or pointers would be welcome the app is running Meteor 1.8.1, pub/sub as data model and is React with some remnants in Blaze.