I’m hoping someone can shine some light on an issue I’m having (the screenshot below is from the second time it has happened). Scenario:
- APM shows a spike in total sessions - usually it’s at around 200-300, but this spike took it to over 1200 for 30-60 minutes
- There is basically no way this was a spike in actual app usage
- The version of the app running on Galaxy is stable, hasn’t been updated in 6-8 weeks and this is just the second time this has happened (the one other instance about two weeks ago). I feel that if this was code related it would be an issue happening more frequently
- We are running 3 of Galaxy’s smallest containers to handle our constant 200-300 sessions
- The spike caused all 3 galaxy containers to repeatedly restart saying: ‘Failed health check’ - no console errors or any sort of error logs are reported
- The corresponding CPU usage spike I believe is a result of the restarting servers rather than the other way round
- Galaxy support say it has nothing to do with them. And I don’t think it does in this case, though in the past they have brushed of and actual issue in the same way, so I’m not 100% confident of what they say
- The sessions count increased to a number that the 3 servers could not handle, which caused one of them to restart, which created a domino effect.
- This is the result of a bot creating connections to my app. Which would explain the spike. Is there any way of ruling this out?
Does that even make any sense at all? What am I missing? Any other theories or troubleshooting steps I can take to try and get to the bottom of this?
Thanks in advance!