Redis Oplog and AWS ElastiCache

I’ve been a long time production user of redis-oplog and I have been using it with a cloud redis server from compose.io. Well yesterday they had a 12+ hour outage which hosed my app (along with many others) and I had to scramble to switch over to scalegrid.com. I wanted to set up an AWS ElastiCache cluster but the setup was too complicated to figure out on the fly. Now that the app is stable on ScaleGrid I was wondering if anyone has experience with setting up ElastiCache (@ramez @diaconutheodor). I have a few questions about the settings.

  • Does it work with the replicas setting? I’ve only ever used redis-oplog with a stand-alone instance (which is why I want to use a replica for failover). Does failover actually work? It seems like when you use replicas you have a write URL and a read URL. How does that jive with the single redis-oplog url parameter?
  • What’s a good EC2 node type? Should work with a pretty small VM right?
  • Does multi-availability zone work?
  • Encryption at rest? Encryption in-transit? Anyone have these working with redis-oplog?

I’m assuming backups aren’t needed since it’s just a cache. Would love some feedback, best practice, and info on this.

I’ve set this up. It’s not too bad, but I never got the failiver working properly. Not sure if the problem is with redis oplog, my config in meteor or my aws config. On the bright side neither the production or staging cluster I set up has had an outage in > 18 months. I used a t2.medium and it was way too big, but it will depend on your usage. Id advise setting up an entire new cluster rather that trying to resize an existing one and switching over the new one. Given that the failiver doesn’t work nicely, neither does multi zone availability unfortunately. Happy to share my config privately if you like

1 Like

Thanks for the VM recommendation. Yeah, I think we’ll be able to get away with something pretty small too. Even with a lot of users it doesn’t seem like Redis has much problem with throughput as a cache.

Did you get any of the encryption or SSL stuff working? So you’re only running a single stand-alone instance too? That scares me somewhat after what happened with compose.io. As our app relies on the reactivity with other users. It’s not just UI icing for a single user. So if that goes down, our app becomes non-functional.

And sure, you can DM your config. I may have more questions. Thanks!

We have a second instance, but the failover doesn’t seem to work, you’d have to trigger it manually do not much point. Encryption in transit yes, encryption at rest I didn’t bother with, no need. I’m away for the next week. Will get you the config when I get back.

Is it possible to access ElastiCache from Galaxy? A lot of the articles mention only being able to access it from within AWS on your VPC. It doesn’t seem to be accessible via a typical redis:// URL.

My Galaxy deployment is technically in the same region, but of course it’s owned by Galaxy and not me. Are you using ElastiCache with your own AWS hosting? Or have you managed to get it to work with Galaxy and/or hosting outside of your own AWS?

You can for sure access over a redis:// URL. I can’t imagine it wouldn’t be possible, if I had to guess you’d need to configure security groups to allow either all public IPs (little bit risky) or the specific IPs of your galaxy containers. If Galaxy uses Amazon VPCs too, in theory they could setup a VPC bridge to allow communication between their cluster and yours. I don’t know if they expose that to customers though.

I’ve only needed it within my own VPC.

1 Like

@evolross

Really sorry for taking so long to answer

We use Redis (Elasticache) on a single t3.micro instance (reserved to save money) and we barely use any capacity. Nothing fancy, same VPC as the Meteor instances (scalable on ElasticBeanstalk).

We didn’t need to use any encryption as all data transmission is within a private VPC.

We use the DNS name for the url parameter of redis-oplog < instancename >.*.cache.amazonaws.com and port 6379

@ramez You host Meteor yourself on AWS. You’re not using Galaxy right?

@evolross, right. We use Elastic Beanstalk in the same VPC as Redis instance

1 Like

Hi @ramez and @znewsham,

I have been trying to use redis-oplog in AWS but everytime I try to configure it I get ETIMEDOUT error without too much logs to debug the issue

Did you guys encountered this issue? Can you share some insigths on how to get redis-oplog working with ElasticCache and Elastic BeanStalk.

=> Starting Waves Meteor App
2020/10/05 10:57:17
Monti APM: Successfully connected
2020/10/05 10:57:18
[1601913438803] - [RedisSubscriptionManager] Subscribing to channel: __dummy_coll_fruBFiLAHMDbKTphr
2020/10/05 10:57:18
[1601913438806] - [RedisSubscriptionManager] Unsubscribing from channel: __dummy_coll_fruBFiLAHMDbKTphr
2020/10/05 10:57:18
[1601913438815] - [RedisSubscriptionManager] Subscribing to channel: __dummy_coll_fruBFiLAHMDbKTphr
2020/10/05 10:57:18
[1601913438816] - [RedisSubscriptionManager] Unsubscribing from channel: __dummy_coll_fruBFiLAHMDbKTphr
2020/10/05 10:57:18
Monti APM: completed instrumenting the app
2020/10/05 10:57:19
[1601913439652] - [RedisSubscriptionManager] Subscribing to channel: users::XXX
2020/10/05 10:57:19
[1601913439706] - [RedisSubscriptionManager] Received event: "u" to "users::XXX"
2020/10/05 10:57:19
[1601913439731] - [RedisSubscriptionManager] Subscribing to channel: meteor_accounts_loginServiceConfiguration
2020/10/05 10:59:25
RedisOplog - Connection to redis ended
2020/10/05 10:59:35
RedisOplog - There was an error when re-connecting to redis {"delay":10000,"attempt":1,"error":{"errno":"ETIMEDOUT","code":"ETIMEDOUT","syscall":"connect","address":"xxx.xx.xx.xxx","port":6379},"total_retry_time":0,"times_connected":0}

@pmogollon

We faced this issue as well. What we found is that Meteor (Node) times out on the redis connection as it is too busy handling other requests or processing too many redis messages.

We fixed some of the underlying issues in our fork of redis-oplog which may or may not fit your use-case

  1. Our fork reduces the number and size of redis messages - this in turn reduces the load on the listening instance (we do this by Diff-ing against existing doc to make sure we only send real changes)
  2. We made sure on our server we don’t have functions or methods that take up too much CPU time in one shot; we used Meteor.Defer and Meteor.setTimeout to break out computationally-intensive functions (Node is single-threaded as you likely know already)
  3. We don’t use protestAgainstRaceConditions:false in cultofcoders:redis-oplog as it sends the whole document which might be too big and requires processing at the receiving end (solved in our fork by removing this option)
  4. On AWS we stopped using burst instances (i.e. t3.large) in favor of compute instances (c5.large).

#2 and #4 may likely be the biggest contributors to solving the issue, but I have to admit we didn’t do much testing to know where the highest impact is, it works now and that’s all that mattered to is.

2 Likes

First thanks for the quick reply.

I actually was reading the thread on GH today, Im going to try your fork and check what kind of benefits we get.

Is interesting that this ETIMEDOUT happens always and just seconds after it starts and it never reconnects.

We dont do many cpu intensive tasks in our core app, those functions are in lambdas and we are already on c5s.large servers.

Im going to see if your fork works with my setup and let you know how it goes.

Thanks.

Thanks, please let know how the fork works (the instructions are in the README). Enable caching for amazing performance.

The fact that it breaks upon start is a hint. Are you doing a lot of setups at start? Either taking up CPU cycles or adding strain on Redis?

Just tested it and it gets the same issue. I think most likely im doing something wrong. My config looks like this.

"redisOplog": {
    "redis": {
      "port": 6379,
      "host": "XXXXXX.cache.amazonaws.com",
      "prefix": "PROD:"
    },
    "overridePublishFunction": true
  },

I dont think im doing anything CPU intensive on start, but it usually spikes cpu on deploy. I’ll investigate on that side, or maybe is just that im not setting up correctly the security group in AWS. Do you have any insights on that part?

Most probably im missing something stupid there.

You are using optlogtoredis?

I am asking as you have a prefix in your config. Where is that coming from?
Also, overridePublishFunction no longer does anything, if you are using the latest version of cultofcoders:redis-oplog

What I would do, is go into the virtual machine of your meteor instance, install redis-cli and then connect to your redis instance. Call the monitor command to see the messages.

Thanks Ramez, ill do that and let you know my findings.

The prefix was because I used the same redis for BETA and PROD environments. Ill remove those 2 settings also and see if it works. Also no oplogtoredis, just redis-oplog.

Thanks @ramez, just got it working. Was a missing config in the security group.

I got a question about your fork. When you talk about caching of collections is the caching of find and findOne, etc or only caching of specific functionality of reactivity?

I mean if I call Collection.findOne() it will hit the cache?

First of all a big thank you for the fork, ramez! You guys did an amazing work on improving the original package, fixing pressing issues and problems.

I have just one question (for now) regarding caching: as documented, TTL is apparently configured in the Meteor settings "cacheTimer" for all collections marked cacheable by invoking collection.startCaching(). Would it not make sense to allow for separate TTLs on a per collection basis? Meaning that documents in some collections could be seen as semi-permanent, and as such, be subject to TTLs of, say, several hours, while others are expected to be cached for short periods only.

Also, if I may suggest, providing some API for cache invalidation, ideally on collection level, or, even more luxuriously, on a document level (collection name + document id), would be a great feature :slight_smile:

Yes findOne and find both hit the cache.
Which is really cool, this has been my dream from day one.

This way, I can write my finds in my code and not worry about db hits. I don’t have to optimize anything, just get the data when I need it.

I used drupal in the past and hated it. If you look at the number of queries to build each page, it was astronomical. So much db hits wasted on the internet. We don’t have to be part of this trend :slight_smile: