Publishing aggregations is REAL!

kschingiz · August 15, 2018, 5:52am

Hi there!

I am happy to announce my package that allows you to publish aggregations.
Check it out here: https://atmospherejs.com/kschingiz/publish-aggregations

Now I am preparing example beautiful analytics application that uses this package.

I will be happy for your feedback and PRs and issues on github.

publish-aggregations DOC:

This package lets to publish aggregation with pipeline and options in MeteorJS

requirements

Meteor 1.7+
Mongo 3.6+

why do we need this

Because sometimes we need more flexible solution than original Meteor publications. For example, developing real time analytics system or you need reactive joins, all of these can be achieved by aggreations, but they are doesn’t supported in meteor publications.

This package heavily uses mongo change streams, and will NOT work properly if your db or application doesn’t support it.

From MongoDB docs:

Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will.

API

You can read about change streams here https://docs.mongodb.com/manual/changeStreams/

publish example:


import PublishAggregations from "meteor/kschingiz:publish-aggregations";

Meteor.publish("example", function() {

return new PublishAggregations(MyCollection, [

{ $addFields: { "newField": "this is my new field!" } }

]);

});

TODO

Test coverage (HELP NEEDED)
Deploy example app

Contributions

Contributions are welcome, feel free to create issues on github

LICENSE

MIT

martineboh · August 15, 2018, 6:27am

This is huge and could finally replace Oplog for Change Stream. This package could get lost in the forum quickly as examples and proper documentations aren’t provided along with the package. Great work!

nickgudumac · August 15, 2018, 7:07am

Wow, that’s pretty neat! Thanks for this. Have you tested it for performance? What would be the impact on server / db performance?

kschingiz · August 15, 2018, 7:41am

@nickgudumac I am working on example application with benchmarks, because I also want to know how it will impact on db perfomance as well. So it will be asap, any help would be great.

kschingiz · August 15, 2018, 7:44am

Thanks for your feedback. This package was completed just today, I am now working on good documentation and example applications, it will be as soon as possible

juho · August 15, 2018, 3:09pm

Nice work @kschingiz!

crownglasses · August 15, 2018, 6:04pm

Other than the added benefit of having aggregation how does this change overall performance? Are changeStreams more efficient than oplog?

martineboh · August 15, 2018, 7:31pm

Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a collection and immediately react to them. As oplog is used in replica set, we can not use change streams on standalone mongodb server. So we have to setup replica set with minimum 3 mongodb servers to see real performance. So, Yes! Change streams are more efficient than Oplog.

msavin · August 15, 2018, 10:37pm

Can you update this to support regular MongoDB queries?

kschingiz · August 16, 2018, 4:50am

I think it’s not possible because this package uses change streams that support only aggregation framework but not regular queries.

kschingiz · August 16, 2018, 4:56am

@crownglasses this package doesn’t do long-polling and any heavy operations, but it uses change streams. I have seen in mongodb JIRA that opening multiple change stream cursors can drop down your db, https://jira.mongodb.org/browse/SERVER-32946

So I don’t know the real impact, we need to benchmark it, it will be completed asap.

kschingiz · August 16, 2018, 5:06am

Maybe yes. But my current solution opens change stream cursor FOR EACH publication, and this can be problem since some users reported that opening +10 change stream cursors drops down your db performance. I think we need to come to more efficient solution, where all publications use just one global change stream cursor(No idea how to do this). Look at my reply to crownglasses there is a link for an issue on mongodb jira.

znewsham · August 21, 2018, 9:36am

Hi @kschingiz, what are the limitations of this package? Change streams don’t support all pipeline operators, and your package passes the requested pipeline to the watch method with a prefix, if the pipeline contains (for example) a $group stage, will it still work? Similarly, your implementaiton adds documents returned by the aggregate command to the collection’s publication handle, will aggregations appear in client side collection.find() calls? Finally, how did you get around the issue with change streams not notifying you of documents which no longer match your query, but still exist within the collection?

kschingiz · August 22, 2018, 5:26pm

Hi @znewsham, sorry for the late reply.

If change streams don’t support all pipeline operators, then my package also doesn’t. Because It strongly depends on it’s functionality.
Yes, Your aggregations appear in collection.find calls. I have checked this.
No how.
I would continue to develop this package, but it seems has no real production usage scenario, because afaik opening +100 change streams can seriously affect on your app/db perfomance.
It was just a good experiment, where I prooved myself that mongodb doesn’t want to make something like rethinkdb did.
This package is not maintained anymore.
Now, I am working on other more persperctive package than this.
Thank you for all.

znewsham · August 22, 2018, 5:34pm

@kschingiz fair enough - I’ve been playing with the same for a while, my implementation uses a single change stream per collection currently and supports any pipeline operators - but it still has to observe all changes on that collection, then basically reruns the aggregation - so its “kinda reactive”. I got pretty excited when you posted this package as I thought you’d cracked it!

kschingiz · August 22, 2018, 6:17pm

But what if you have +100 aggregations? It reruns all of them? I fear if you insert in batch +1000 docs into this collection, then your db will fall down quicker than meteor falling to earth.
I am just wondering why in 2018s we don’t have db with true realtime features like rethinkdb has.

znewsham · August 22, 2018, 6:37pm

That is the drawback - its handled trivially by throttling the rerun method to 1 per 10 seconds, and we compare existing docs to returned docs, so the client only sees docs that have actually changed.

There is a trivial optimization if the first step of a pipeline is a $match and you can ensure that the query is over immutable fields. You can then use the $or of all those matches as the limit of the change stream. But its an optimization that only works in some cases (fortunately it works in most cases I’m interested in).

Another optimization there would be to fetch the documents initially returned by the first step $match and limit the change stream on them - this doesn’t cover the cases where new documents are added to the set. I’m not confinced there is a generic, optimal solution here - If mongo did support this, it would need to do something similar internally - it would just abstract out the complexity one layer.