CONVERSATION: In place of LiveQuery, how should Meteor build webserver tier subscriptions (i.e. for GraphQL/Relay)?

faceyspacey · January 22, 2016, 1:28am

I have always wondered why Meteor resolved to use oplog to monitor mutations to provide for subscriptions, when if programmed more generically on the webserver tier could more easily be expanded to other databases. Was it absolutely necessary to use oplog, or is it simply more performant?

I see that the Postgres implementation uses triggers and stored procedures (again, on the database tier)–it appears some queries might even be impossible or simply too costly to maintain enough state on the webserver tier in order to properly capture new subscription data from mutations.

Or perhaps, the challenge was that when dealing with webservers in a cluster, each server would have to propagate all updates to subscribed queries to all other webservers. Or perhaps something more generally related to keeping a bunch of webservers in sync, after all, each would have to store the definition of each subscription in order to filter mutations.

In general, I feel like not enough people truly understand the ramifications and challenges at stake with this endeavor. And I think it’s more than ever an important time to understand it.

HERE’S WHY:
GraphQL/Relay doesn’t out the gate provide subscriptions. Coming from a Meteor perspective–if you are like me–at first I just assumed all queries were automatically kept up to date. That’s far form the case. GraphQL and Relay don’t currently even support subscriptions. When they eventually do, it will be up to you on a query by query basis to optimize their GraphQL types and resolutions to be automatically kept up to date with the database. That’s a lot of work, and for many of us we won’t know where to start.

Reading between the lines, what becomes clear is this: Facebook found it impossible to scale an entire database generically! So, instead, they likely only keep a few things up to date, and probably use unique means to do so for each query/feature! This is the opposite of Meteor’s current behavior, where the entire database can generically be subscribed to!!

So with a future where we will likely have to build our own GraphQL types and resolutions on the server, and/or where Meteor will provide abstractions to make this easier, I think it’s very important we understand how we would go about this.

So, is it possible solely on the webserver tier to monitor queries and their datasets, and determine when new data enters these subscribed datasets? Or, do we always need the entire state of the database, and therefore need to maintain such descriptions on the database tier?? Or perhaps it’s just some queries that need access to the entire state.

How much of this can we build generically on the webserver tier irrespective of the database actually used? That way we can benefit from lots of code reuse between database implementations.

Here’s some example queries to get your minds rolling:

Persons.find({ }, {sort: {updatedAt: -1}, limit: 10})
Persons.find({ age: { $gte: 21 }}, {sort: {updatedAt: -1}, limit: 10})
Persons.find({ $or: [ { age: { $gte: 21 } }, { sex: 'f' } ] } {sort: {updatedAt: -1}, limit: 10})

It seems to me that in these examples when a subscriptions is made for one of these queries, you perform the query against the database once and keep the 10 rows in webserver memory, and then filter out any future inserts/updates that result in rows that site nicely within that 10 row dataset.

What challenges do we have to accomplish this?

The biggest one I’m currently seeing is that updates don’t pass along the entire state of the mutated row through the webserver; what is passed along is only, for example, that you want increment the age to 22 for a given ID. That’s all the webserver sees, not the values for all fields for the given row. That means the webserver has to perform a database read for every mutation to truly see if the mutated row matches a subscribed query.

Perhaps the above example is automatically solved if you assume that users on the client can only update rows already in a subscription’s dataset. That’s often true, but it’s not true when you make batch updates directly from the server (i.e. as within a Meteor Method). In that case there are no rows already subscribed to. This right here seems to be the main reason why a database-specific oplog approach was taken–it’s a lot more performant to execute all this additional querying on mutations within the database; you just have to send information regarding your subscription queries to the database and have it stored there in order to facilitate this decision-making. Same goes for stored procedures and triggers as used with Postgres.

Lastly, I think if we can collectively get an idea of how to implement this stuff, we can come up with a simplified interface on top of GraphQL that application developers can use.

Basically what I imagine is the ability to “drop down” and use GraphQL for cases when you want to customize for performance and the like, but can begin by prototyping your app via a far simplified interface that probably looks identical to what you have now. In addition, there is a 3rd option in the middle perhaps, i.e. a simplified GraphQL that isn’t fully automatic like what we have now–perhaps this is just specifying a schema similar to SimpleSchema and what you u use in the Mongoose Mongo ORM that’s popular outside of the Meteor world.

As far as subscriptions go, there should be a straightforward mechanism/switch to turn on certain queries to be subscription based. Perhaps the current mongo LiveQuery approach can optionally still be used behind the scenes (in conjunction with GraphQL), but you can turn it off and manually build more scalable subscriptions for the individual features/queries that really need it.

And as far Relay or whatever Relay-inspired API we decide to give React, we provide a more opinionated and consequently simplified interface. For the most part, the API should be simply a query (or set of queries) paired to a React component. But we can get to this later, once we have the pubsub stuff down.

joshowens · January 22, 2016, 3:19am

It isn’t necessary, but it is WAY more performant to use the Oplog to get updates. Prior to using oplog, Meteor used a poll and diff. LiveQuery would setup a loop to monitor and check for updated data by doing a new query. The problem is running each of those queries doesn’t scale to more than 50 users or so per CPU. Consuming the oplog for updates means that you can keep a set of queries you care about and looking for matching data as it comes in.

I believe LiveQuery is the piece sitting in the web tier, tracking the data that will be sent down to clients. I think this is the part you are saying you would want to expand more? Honestly, I understand how it all works, but I don’t have a ton of hands on with the actual code. Maybe @avital or @nim would have a better idea here?

faceyspacey · January 22, 2016, 3:35am

My assumption has been that “poll n diff” is unnacceptable–so what true “push” solutions do we have that occur at the point of contact with new data, aka mutations?

Can we simply intercept mutations on the webserver tier before they hit the db? What are we getting from oplog and triggers that we can’t get purely in the webserver?

joshowens · January 22, 2016, 3:40am

@arunoda did some early experiments as a precursor to Oplog, pretty sure he used Redis, perhaps he could chime in here a bit too.

sashko · January 22, 2016, 3:42am

I think broadcasting invalidations from the app server is the only way you can support arbitrary data backends, then you don’t need to build live query for every data store.

faceyspacey · January 22, 2016, 3:51am

@sashko I guess the big question is then: why didn’t Meteor do that to begin with? What did they see in oplog that was so beneficial that they forewent the opportunity to more easily plug other databases?

I mean you went straight for using triggers with Postgres, right. I believe it was you or that you had some part in this. What are you seeing as the big obstacle, the big downside, of intercepting (and then broadcasting) writes at the webserver layer pre-database??

sashko · January 22, 2016, 3:54am

There isn’t a single big obstacle, but it requires a lot of changes to some underlying workings. Meteor was built on the most direct path to achieving the current set of features, and that required tying some stuff to Mongo quite tightly. That got us to where we are now very quickly, but it means it’s a bit more work to architect something good around a different backend.

faceyspacey · January 22, 2016, 4:17am

What about what I mentioned regarding updates? You basically gotta perform a read in addition to each update write to get missing data to see if the updated row matches any subscriptions.

Also why did you or whoever take the triggers-based approach with Postgres?

sashko · January 22, 2016, 4:21am

I think that was @slava and that’s just the approach we wanted to validate since it was similar to some community packages.

slava · January 22, 2016, 4:25am

Also why did you or whoever take the triggers-based approach with Postgres?

Because that’s the only one we knew worked? Custom fork of Postgresql (like pipelinedb, which btw is not persistent and uses code from Redis bloom filters for aggregates) doesn’t work for people who just want to connect to an existing running db.

faceyspacey · January 22, 2016, 4:33am

Hey @slava. So then you are unsure about how to approach this purely on the webserver tier, i.e. without triggers?

Without triggers, how would you approach updates? Perform an additional read to get the entire row and then see if it matches any subscriptions?

How would you approach removes? Ie if u remove a row in a subscription dataset–just re-read the subscription query then update its dataset?

numtel · January 22, 2016, 3:04pm

@faceyspacey You seem to be alluding to data invalidation similar to Chet Corcos’ any-db project when you speak about moving the data mutation checking to the webserver tier.

https://github.com/ccorcos/meteor-any-db

This type of approach requires the developer to maintain a mapping of updates to each query used by the application. By using a LiveQuery style interface, this burden is removed from the developer.

You single out the use of triggers as a pain point for building a LiveQuery interface for Postgres without reason. Postgres is well designed to handle triggers in these ways. The Slony replication tool uses triggers in versions of Postgres before replication had been implemented into the core.

The problem of knowing when to invalidate a query’s result set is much greater. In my pg-live-select and mysql-live-select packages, these can be specified manually but can become very complicated depending on join conditions. The Postgres preview package available from MDG does not yet implement any solution and simply refreshes each query that references any table that changes.

shock · January 22, 2016, 3:40pm

Well, I feel like current DDP is lacking in some ways as it is quite coupled with add/change/remove.
Maybe some experiment with protocol which have more like “mutate” action and transferring Immutable JS diff could be interesting.

And you would have to maintain 1 big Redux Store server side, representing part of DB which is currently being watched with listeners set for every unique publication. And server side query would need to be handled kinda like temporary subscription.

If Datomic can kinda do it, there should be some tricks to keep memory consumption on normal levels

Enough science fiction

faceyspacey · January 23, 2016, 12:26am

I think you’re missing my overall point. Of course triggers is a “better” way. A better way for a specific single database. But what I’m referring to is making headway on a database-agnostic solution. I think if we can make something generic enough, we’ll make the greatest impact here, especially if it also has a “drop down” lower level API as previously mentioned. I think this will have the greatest impact on GraphQL usage, efficiently interfacing with the view layer (Relay etc) and Meteor in general since we now would support basically every db.

…As initially described, one of the main goals is to offer multiple levels of interface:

A generic less optimized one (built purely on the webserver layer)
and an interface offering complete customization (perhaps where you could use triggers if you like)

In the latter’s case, someone like yourself or Chet could replace some of the more generic stuff with something, for example, based on triggers, or in the case of Mongo, oplog. Perhaps, only for a subset of queries that really need the optimization.

The real main idea being that real apps can create high-performance solutions for the minimal amount of places that really need it, while being able to rely on a “good enough” approach for everything else. However if, for example, you wanted to optimize every possible query with something based on triggers, great. In that case, maybe it doesn’t even need to participate in the system I’m suggesting, but more than likely it will benefit from being plugged into it (i.e. its ecosystem, or various abstractions you also need).

…So triggers all the way man. Trigger away. But let’s redirect the conversation back to the webserver tier. Can you describe what you know of the “mapping of updates to each query”?? I would love to gain greater awareness of various specifics in implementing this wholly on the webserver tier. Any info you can provide, possibly about Chet’s solution, would be much appreciated. @ccorcos if you’re out there, would love to hear from you as well. I’ll looking to meteor-any-db too.

aadams · January 23, 2016, 12:48am

@faceyspacey just so I understand, you’re brainstorming ways to bring reactivity to GraphQL? And since GraphQL is db agnostic, you’ll need a reactivity solution that’s db agnostic as well.

numtel · January 23, 2016, 1:33am

See the last line in the example on the readme of the any-db package:

AnyDb.refresh 'messages', R.propEq('roomId', roomId)

The developer must make a signal to refresh to each reactive query when an update to the data occurs.

The difficultly of maintaining accurate, efficient refresh signals correlates with complexity of an app.

faceyspacey · January 23, 2016, 2:02am

yea, that I guess is the point–through heuristics and lots of “tricks” we can optimize the F*#K out of this. And of course it’s possible. Good point about Datomic.

As fas the communication protocol, e.g. DDP, I’m less worried about this. After all, we’re using GraphQL, which doesn’t prescribe the client-server protocol but will have many plug-n-play solutions for websockets. So we gotta focus on–basically–the idiosyncracies of each type of query and the type of state that must be maintained to properly intercept when updates are needed. Removes are different than updates. Updates are different from inserts. Can we avoid having to make double the db requests, e.g. when we have to read an entire document on updates? Perhaps not. Here’s a crazy a optimization to think about: if so many writes are coming in (i.e. multiple per second), it would make more sense in that case to do a poll-n-diff strategy! Or at least throttle any re-reads we must make. Like @ccorcos package handles all the base questions I’m asking is my guess. So then we gotta take it up a notch and optimize at these annoyingly fine-grained level–to squeeze out as much performance as possible for as many different unique usage scenarios as possible.

Considering all that, you likely quickly realize using oplog or triggers is by far better. That’s why MDG didn’t take the more generic approach. The generic approach is just so shitty. But it’s not shitty if there are “escape hatches” to customize. For example, take RethinkDB. Using their subscription API will definitely be way more performant at this–it’s native subscriptions coming right out of the database! They have one last patch they must make before they can play nicely with Meteor. Surely it will be very tempting to make a direct RethinkDB + GraphQL implementation. It’s ultimately what Meteor/MDG wish they had all along. What I think instead we should do is build this generic system, but build very specific APIs that allow you to plug into RethinkDB’s built-in subscription API. Ideally we use the same thinking with SQL triggers, and Mongo oplog as well.

This system should both be able to easily take advantage of built-in database features that make subscriptions easier, while also providing a full featured generic API for databases that have no such support. It’s a serious undertaking. That’s why we start with the generic approach. If we can get that accomplished, many motivated developers dedicated to various DBs will join the cause just to get first class support for their DB, and ultimately lead to contribution to the overall system as well. It’s a similar concept to building something as generic as GraphQL to begin with. We are just focused on the generic subscriptions aspect. What I imagine is a sort of ORM around the different types of queries that can be made: removes, updates, inserts, etc. When a remove is made for a row in a current subscribed dataset, a callback is fired that re-runs the entire initial query to make sure the dataset goes from having 9 back to 10 documents (i.e. the limit set). That callback will be different per database obviously. We could even take what @ccorcos made, and make a more generic abstraction like this (while also addressing more heuristics). Then we publish the API, and let developers add support for other databases. At the very least, the documentation would serve as a great analysis of the problem domain. There’s a lot of different cases here. There is likely certain queries that would be addressed various different than others–and that’s really what I’m getting at in this whole thread: I wanna find out the scope of the problem; is it just different solutions for inserts vs updates vs removes vs upserts; or is the scope more expansive to include a matrix of different queries X inserts/updates/removes/upserts. At first, I’m not worried about clusters of multiple webservers each addressing different clients and subscriptions. The assumption is one webserver handling all clients. So the discussion is about cached subscription definitions + mutations broadcasted to all servers in a cluster using Redis. That’s not what this conversation is about–not yet. It’s just about discovering different heuristics around monitoring different queries.

faceyspacey · January 23, 2016, 2:05am

i see…so our goal would be to automate that. Do differing queries lead to a near-infinite set of solution for them, or is there a 5 to say 100 different manageable heuristics we can implement?

The goal of this thread is essentially to codify those set of heuristics.

faceyspacey · January 23, 2016, 2:16am

exactly. …I’d like to solve the subscription side of GraphQL. I’d like to build a deeper implementation than what GraphQL will eventually provide for subscriptions (they will only provide a very basic interface), but generic enough that multiple DBs can make use of it.

In short, we should build subscriptions for SQL without using triggers. I’d like to discover all the heuristics to solving this (that’s what this conversation is first and foremost about). Priority after that is abstracting it into something more generic that multiple DBs can use.

ccorcos · January 23, 2016, 5:16am

Very true. This is an issue I ran into when I built findashindig.com. I’m constantly changing my mind and trying to figure all this stuff out, but right now, I’m starting to think that the big complex ORM approach is the wrong way to go. The AnyDb stuff was interesting, and to make that refreshing stuff more automated and easy, I tried thinking of some sort of model to compute the refresh, but failed. The graphql “fat query” is interesting, but also has its shortcomings.

Lately, though, I’ve been really starting to think that Datomic and Samza are the right ways to think about all this. If you think of queries just as transformations over streams of data, then can just create a pipeline that handles all the reactivity for you. The key here is that this stream of data shouldn’t consists of events, but facts. this is an interesting talk about samza.

Also I think that caches on web clients ought to work the same way, simply a bunch of reducers over a stream of data from the web socket. Falcor and Relay have some interesting ways of thinking about caching data, but they’re quite complicated and it just doesnt feel right to me.