Speed/Reliability issues on Galaxy?

hanskohls · August 24, 2019, 1:09pm

For the last 3 years, I have been running a service for a client on Galaxy. It’s a simple B2B online ordering platform. If there is any significant downtime it may cost my client potential extra shifts and downtime. I’m a one man band so I decided to host this using Galaxy.

For reference the Galaxy setup was always running on at least 2, often 3 compact (512MByte) or even the 1GB containers as I hoped this could fix the issues.

Speed issues

Over the last 4 months, I had repeated requests that the performance of the app was really slow. There are three pages (dashboard, ticket, client-admin) that all use different subscriptions/filters so the time to for all subscriptions to be ready took between 4 and 6 seconds on average but about 1 in 10 connections experienced a spike that took on average around 40-50 seconds. A staff member has to open these tickets often whilst on the phone with impatient clients and this was not acceptable.

I initially thought this might be because I was hosting on a shared mlab server and that other requests may delay this processing.

I ported the database to mongodb’s Atlas where I have an M10 cluster for other cilents.

The time to subscriptions being ready became better, with the normal subscriptions taking around 2-3 seconds instead, even though this was now in a different AWS data centre. The spikes remained and did not seem to improve.

I then deployed the App on a different host (a 1GByte Vultr VPS) but the same Atlas DB so the client could test this internally. The normal subscriptions were now ready within less than 1 second. And I have no evidence of spikes taking any noticeable delay.

(Un)healthy Containers

Occasionally, one of the Galaxy containers restarted due to the cluster being unhealthy. This would usually show that one of the containers was sitting without free memory for a few minutes. In rare instances, this meant that a client ticket was not completed and as the page reloaded the client would have to repeat this again.

I had spent days trying to identify the cause for these memory spike (which did not coincide with the subscription issues). They had been a major cause for me not to try and move this service to Vultr entirely, as there’s significant cost if its not running and I’m unable to restart it.

However:

Except for deployment outages of 1-2 minutes, I have 100% uptime on all 9 of my Meteor services that I’m running using mup on Vultr (for over 18 months).
This application, which has now been running on vultr for 2 weeks has not shown any of the signs of sudden CPU or Memory usage, which makes me think whatever process is causing this is not part of my app but something in Galaxy.

Any suggestions, or pointers would be welcome the app is running Meteor 1.8.1, pub/sub as data model and is React with some remnants in Blaze.

raphaelarias · August 25, 2019, 12:44pm

We never had problem running on Galaxy, but I don’t recommend running on Compact, it’s very slow.
Do you have indexes?

cereal · August 25, 2019, 2:16pm

The lowest tier galaxy is definitely slow. I first tried deploying there for my meteor app, and was taken aback by how slow the responses were taking. Probably a combination of galaxy + low tier atlas, but requests were taking 4-5 seconds each.

Switched over to self hosting both meteor and mongo and everything’s almost instant.

juanpmd · August 25, 2019, 4:19pm

Are you currently using indexes in your MongoDB database? I had a similar problem a while ago and I was getting a very big delay in some data fetching. After using Indexes everything changed. Currently we are using Galaxy with MongoDB Atlas M10 and we manage big data requests without any problems.

hanskohls · August 25, 2019, 10:49pm

Yes my database is completely indexed for any criteria used in any subscription. One such index is not working reliably but it responds in < 300 ms on the self-hosted setup.

hanskohls · August 25, 2019, 10:55pm

Yes every criteria used in subscriptions is backed by an index. None of the collections are large (<100k records) or have complex indexes (except one that works on an index on the date but cannot index a second value easily). In one case we may get about 1000 records and I did consider paging, but it’s rare and the issues occurred not only on 1000 records but also when it only returned 25-100. On the

raphaelarias · August 26, 2019, 1:59pm

The only problem we had with Galaxy was when we were using Compact instances, even if the workload was small.