Critical: Galaxy network failures

Galaxy apps (at least mine) fail to load across regions. From my testing so far, there seem to be no errors, just hangs:

  • First, users reported issues from multiple regions (France, Vietnam, Canada)
  • Second, I verified by connecting to the apps using VPN via other regions, and confirmed the network hangs
  • Third, to prove it wasn’t my app itself, I migrated to Railway yesterday, and network failures are not happening there.

I started having issues in us-east-1. I’m guessing some limits are reached in Galaxy’s piping when the app is sending many files at once? It’s not many users, so this isn’t like super intense.

After a number of successful file transfers, the endpoint starts failing with 503 errors:

I just experienced the hung connections in us-east-1, it just sits there not loading:

I’ve received both 502 and 503 errors.

There seems to be something that happened to other regions first, and now I’m finally experiencing it in US East.

For example, for one path, it takes 30 seconds to load a few MB:

Other websites are fine, so it isn’t my network speed. It’s the app on Galaxy.

Sometimes the HTML page itself goes 503 with No server is available to handle this request:

If I keep trying and get lucky to have the app loaded, it still seems to always fail on the web socket connection:

Hi @trusktr

Have you opened a support case with us for investigation? We have a large number of customers running Galaxy across different regions and continents and haven’t had reports of this, so it’s likely something specific to your organization and your app.

Reach out through our real-time chat and ask to be directed to Philippe. I’ll investigate your case closely. We receive a high volume of support requests and try to centralize everything through the chat so nothing gets lost.

Best regards,

1 Like

By any chance, are you seeing this problem when using Meteor 3.5-beta.10?

I am also seeing this issue in one of the apps I am testing as part of the 3.5 beta, and the runtime logs show problems with Change Streams.

I ask because you have been participating in the Meteor 3.5 post and discussing specific experimental features in that beta version. If you are getting that error on 3.5-beta.10, reverting to a stable and recommended version may help (Meteor 3.4.1).

It is important to follow up on the specific actions taken and provide context. If you decide to opt in to a beta version like 3.5-beta.10, which we really appreciate for the testing and feedback, could we ask you to continue the follow-up in that specific topic?

I have detected a similar problem, and I think yours could be related to 3.5-beta.10. I am researching it to understand what causes it and report any findings to the Change Streams and 3.5 specialists.

Together, we will make next Meteor releases stable.

1 Like

In one week, we have issued two times that our app does not respond. We have several subdomains, connected to galaxy. Let’ say like a.mydomain.com, b, c, d, etc.

Suddenly, almost same time in different days none of our sub domains didn’t respond. The app itself was working in Galaxy, since the domain that galaxy provided worked. This has never happened before and this was only in app that has been published to galaxy metal. No errors anywhere just http 502 response. All my domains are Route 53 in AWS and all other domains worked meanwhile.

I don’t know if this is same issue than yours but very strange anyway.

1 Like

No, I had the same issues on earlier versions too.

I’m seeing the same problems on other services like Railway:

I’m guessing this is an Amazon AWS problem? Is it because Iran bombed two of their data centers in the Middle East in retaliation to US bombing? Has AWS changed something? What could it be?

Today connections are all good again. Unfortunately I didn’t get to investigate while the connection was bad (f.e. with traceroute while VPN’ed to other countries).

This seems to be something outside of the control of Galaxy or Railway. Maybe AWS, or maybe domain systems. I’m not sure what yet. I’ll debug when it happens again.