Proposal for Scaling Meteor without Oplog Tailing

jacobin · November 9, 2015, 9:16pm

Postgres, RethinkDB and Redis all have some sort of pub/sub built in.

Stack Overflow uses a mere 9 web servers to support 260k+ concurrent users with tons of reactive data, all being fed from Redis’ pub/sub. Their biggest issue is running out of sockets at the os level.

It makes the silly Blaze thread quite mind boggling. What exactly is the point of fretting over which front-end to stick onto a back-end that doesn’t really work once it is actually asked to work?

@zimme says that Hansoft really want to give their solution back to the community. Hopefully we’ll see it one day. Much as @msavin suggested, it involved inserting an intermediary between Meteor and the database.

For sure many things wouldn’t be noticed if it were not sub millisecond. It seems though with meteor everything by default is assumed to be critically reactive. I see people in irc and the forums using mongodb and pub/sub for chat messages.

Perhaps that’s a function of the docs and the way they present things.

SkinnyGeek1010 · November 9, 2015, 10:27pm

I’ve been looking into how Phoenix handles it’s realtime ‘channels’ and they approach it differently. This isn’t going to get Meteor 2mil websocket connections but it seems to solve the livequery pressure.

It actually works without any database if that was needed. Instead of trying to sync to the database and tail an oplog the approach for a public chat room would go something like this:

the last three are the most interesting

server listens for a "lobby:join" event
client loads JS library and connects with one socket and multiplexes different ‘channels’
client calls "lobby:join"
server handles join event (and opt. authenticates)
server after:join event fires and does query for chats and returns to client to prime their data
client fired "on join" callback and uses resp. data (for Blaze store in Session for temp. helpers?)
any client subscribed to the channel would send "msg:new" to server
server would handle "msg:new" by taking new message and pushing it down to each subscriber to lobby
each client on msg:new would push the data onto their prev array of data.

The data gets pushed down as new data is sent up to the server. You’re only listening to users connected to that ‘channel’.

Also, if you’re using Rethink or Postgres you can have the server listen for triggers on a query and then have the server itself push down a "msg:new" instead of relying on the client to send up a changed event.

However one issue, using a DB without query triggers means you can’t directly do an insert/update on the client and have reactivity. IMHO doing a "msg:new" instead of Message.insert() isn’t a big loss for me and it also makes it trivial to use other databases.

skini26 · November 10, 2015, 12:51am

Why not just simply use Redis to cache and publish all the events while they are written asynchronously on MongoDB ?

Example :

Client calls method : ‘msg.insert’ : the method will insert the msg in MongoDB and in Redis.
The publication on the server will subscribe to Redis for a particular room and send an ‘added’ event.

This method can easily scale and uses Redis as it should be used.

(Of course, we’ll need to imitate the DDP protocol with Redis to add different events like : ‘changed’ and ‘removed’)

jacobin · November 10, 2015, 6:03am

So a standard observer pattern. However one thing you’ll need to consider when horizontally scaling is to have a single node or cluster of nodes handle all traffic for a particular channel. That complicates your front end proxy, but only slightly.

Otherwise you end up with something more resembling an eventbus where all data is just dumped onto a shared bus. That works ok in a stateless rest environment where the first node with time to act on it removes it from the bus. Does not scale when you’re trying to synch state across nodes.

Check out the above video at 21:00, where they describe using Socket.io and Redis in a similar manner. When they got to concurrent user 1001 the entire setup collapsed. Adding more nodes just made things worse. Saturation city!

SkinnyGeek1010 · November 10, 2015, 3:17pm

That’s really interesting! And frightening! It seems like Node is really hard to scale in general Luckily I haven’t needed web scale yet

Yea I was wondering how that would work across nodes. One thing I found out is they don’t require sticky sessions. I dug into this further and they’re using the Erlang VM’s ETS in memory database to handle this somehow. If you’re using Phoenix on Heroku there’s a Redis pub/sub connector. I wish I could dive further into how they’re doing this but i’m getting in over my head now

I also asked the Phoenix author about how you would handle getting out of sync if you’re only listening to those client messages:

[Me] How does Phoenix handle clients getting out of sync using channels? It seems like if you add a database to the chat example it’s possible for the client to get out of sync?

[Chris]: you’d keep a “last_seen_id” on the client
on first join to a channel (app boot), you pass down a list of messages to sync the client up
subsequent joins (dropped connection/channel crash/reconnection), you pass up the last_message_id you saw
then your join query only sends messages since that

skini26 · November 10, 2015, 5:12pm

I talked about something like that a while ago but nobody answered or gave his opinion :

SkinnyGeek1010 · November 10, 2015, 7:12pm

I imagine something like Redis or RabbitMQ would be needed. I’ve heard good things about MGTT being used for this too.

I’m actually experimenting with something like this now… should have the experiment open source soon!

I’m wondering what the limits are for just opening up a publication (as in how many can one server handle)? And are multiple publications on a single socket? (i’d assume so).

jacobin · November 10, 2015, 7:17pm

As long as your requests are totally stateless (REST) you can scale node like crazy. Any particular node doesn’t know that any other nodes even exist. The problem with oplog tailing is that every node is de-facto observing the total state. You end up saturating cpu really quick when there is a write.

Unfortunately in the video he doesn’t mention the limit they reached when switching from socket.io to socks.js and Redis pub/sub. The final solution in the video was a high performance db proxy written in c. I’m not sure what Hansoft’s db proxy is written in, but he did harp about how most of them are “C++ Gods”

Zimmy was however indicating that the Hansoft changes were working fine at up to 10k concurrent connections per meteor node. While that’s nowhere close to running out of ports or file descriptors I’d be still happy with that!

https://vimeo.com/113703576

Here’s a talk on how StackOverflow uses Redis (mostly live coding!). He compares SQL with Redis at one point with SQL handling 9k ops while Redis churned out 19k ops.

Redis is fast. All in memory and literally hand tuned to take into account things like chip and instruction cache and whatnot. The eventual limits of Postgres or RethinkDB’s pub/sub I have no information on, but it won’t be close to Redis.

The guy in the video sure answered your question. The answer was… err no! At least not with socket.io.

skini26 · November 10, 2015, 8:12pm

But it will be at least better than tailing the oplog or subscribing to Postgres or Rethink right ?
We’re not talking about 500M but just something better than the actual way we do things.
I’m interested by this way because it would be easy to do it and it uses Redis as it should be used.

jacobin · November 10, 2015, 9:59pm

Certainly anything is better than the dark O(n^2) hole that is Meteor oplog tailing. Even sans some sort of DB proxy, using the pub/sub mechanisms built into Redis/Rethink/Postgres would be a vast improvement.

However, using socket.io to replicate state across nodes I wouldn’t exactly put into the ‘anything’ category as it causes the exact same problem. Look what happens to your diagram when you add a 3rd or 4th node. Now extrapolate that to say 10 nodes…

Even Roto Rooter couldn’t unclog node’s event loop at that point. It will look exactly like the diagram in the Hansoft video.

ccorcos · November 12, 2015, 5:17am

So here’s my question:

How does a giant service like iMessage work?

A solution like oplog tailing doesnt work because there are so many messages being sent every second that your servers couldnt possibly keep up with the oplog.

After talking with a few people, I believe the solution for a chat service like this is to have a database to look up what server each user is connected to. That way, when you want to send a message to someone, you can send it to the specific server that serves that user rather than spam the entire cluster like the oplog does.

So thats interesting. But it gets more complicated when you take a more challenging scenario like this:

How does the like count on a Facebook post get updated reactively?

Along the same lines as the last solution, perhaps theres a database that keeps track of every post that every user is looking at on Facebook. Then when you like or unlike a post, you can see what servers are hosting the users that are currently looking at that post and send the message off. This is interesting, but I couldnt image Facebook is keeping track of every post that every user is looking at. That just seems insane! Not to mention, there would likely be a lot of race conditions that could cause the like count to get out of sync. So how does this scenario work?

I’m a huge fan of the idea that the database would just take care of all this, but I’m not sure thats realistic in the near-term.

SkinnyGeek1010 · November 12, 2015, 2:55pm

How does a giant service like iMessage work?

iMessage doesn’t disclose much but WhatsApp does. From what I remember, if you break it down they’re only sending the data once when the user connects, and after that they’re just sending messages and the app will store the new ones.

They can hold open the connections easier because that part is on erlang which is basically made to hold open lots of connections (built for phone switches)… though Phoenix has proven that you can also run around 2 million websocket connects on a single box before hitting OS limits (it’s also using the erlang vm in a coffeescript sort of way).

I think this video explains more:
https://www.youtube.com/watch?v=c12cYAUTXXs

bitomule · November 12, 2015, 3:26pm

Thanks! In our case we can’t separate data to a different db but I’m planning on moving all the chat to a different db.

bitomule · November 12, 2015, 3:30pm

We’ll try to move what we can to methods vs subscriptions but not sure there’s anything we can move without killing user experience. Hope we can find something better or MDG publishes a fix if there’s any.

SkinnyGeek1010 · November 12, 2015, 6:13pm

We’ll try to move what we can to methods vs subscriptions but not sure there’s anything we can move without killing user experience. Hope we can find something better or MDG publishes a fix if there’s any.

I’m working on an experiment that should be ready this weekend (with screencasts of load testing on blitz.io). The basic gist is that you can opt some publications to use this without having all of them use it.

Meteor just passes data through (from a secondary app on the same server) so Meteor’s limit becomes the number of websockets it can hold open per instance. You’ll also be able to use the publication/subscription the same as now (though i’m thinking ditching this could offer large gains) I’m aiming at getting a chatroom example app to have several thousand concurrent users on a small $10 digital ocean box.

It should be very easy with a small helper package to glue the wiring together… and it will be open-sourced of course!

Does anyone know of a rough writes per second limit on the oplog getting clogged?

Thanks! In our case we can’t separate data to a different db but I’m planning on moving all the chat to a different db

@bitomule do you have a certain DB in mind? in my tests i’ll be using Mongo for one and RethinkDB for the other (the latter simplifies things greatly, and the new 2.2 release adds atomic changefeeds and should be released in a few hours).

bitomule · November 12, 2015, 7:03pm

I’m not sure if we’ll move the chat to another meteor + mongo, not on planning yet. The issue we have is that our app is already built using publish composite so we need something that fits in that and doesn’t require changing all the app.

SkinnyGeek1010 · November 12, 2015, 7:17pm

Does the chat data need publish composite? or was the other app data? So if my experiment is performant enough you could still use Mongo and publishing multiple collections. We’ll see how that goes though!

edit, ah yea I think I understand now… if you moved chat to a different DB the rest of the app couldn’t do a composite join with the chat data?

bitomule · November 12, 2015, 7:54pm

Chat could be separated in a different db or use what you’re working on when you release it Not sure how it will work with our app. If we don’t use a different db how would it fix the oplog tailing issue?

SkinnyGeek1010 · November 12, 2015, 8:10pm

By not using it

There will be lots of ways to skin it, from least amount of work to most performance. I’ll be able to give you a better idea once I can try out different combos.

The easiest solution should just work with a simple adapter in the publication.

ccorcos · November 12, 2015, 9:19pm

Yeah, thats pretty intense stuff. Idk half the stuff he’s talking about though! I’d really like a one-size-fits-all scaling solution though

Specifically, I want to be able to scale a service like facebook or instagram – a follower network.