Denormalisation vs reactive joins

ayushchd · August 30, 2015, 8:42pm

I am getting started with Meteor for an app I am working on. I have a data model which is quite normalised and for a few reasons I want to keep it normalised as much as possible. For doing reactive joins, I see two options:

Using the https://github.com/englue/meteor-publish-composite package
Actually doing roundtrips to the server. Step 4 in this article - https://www.discovermeteor.com/blog/reactive-joins-in-meteor/ However, since I am not using Iron Router as of now, so I plan to use observeChanges in my “parent” collection and then subscribe to additional entities whenever there are updates.

The only big advantage of second over the first is that I can use the “_$in” operator in the latter, thereby reducing the number of hits to the db. I am not quite sure which is a better way. Or should I just re-think my data model and denormalise everything?

Steve · August 30, 2015, 9:31pm

Denormalisation is an optimization technic that brings its bunch of problems. I never use it unless I actually need to optimize.

The two options you describe are fine. Choosing the right one depends on your project. Usually, I use client-side join (your second option) whenever possible (i.e. when I got all required data client side) and server-side join (your first option) when client-side join is not possible.

I don’t favor server-side joins because:

I have the feeling it does not scale well (too much load on server).
When needing different data, I find it safer to tweak my client code rather than my publish functions.

ayushchd · August 31, 2015, 12:58am

Hi Steve,

I am curious about the situations when client side join would not be
possible. Can we not always just publish the required data to the client? or
are there some scenarios when doing that is not recommended?

corvid · August 31, 2015, 1:22am

in my experience, denormalization will only effectively work on 1 to many relationships in terms of data modeling.

Steve · August 31, 2015, 8:46am

Sometimes the required data is secret. For example, you might want to have joins on the user id field, but you usually don’t want to publish user ids to clients.

entropy · August 31, 2015, 10:29am

Secret data is a valid reason for doing server side joins. But I hope you aren’t relying on keeping user ids secret, the ids generated by Mongo are predictable and should never be assumed to be secret.

Make sure you always do proper authentication and authorization on resources and never rely on a malicious person not knowing Ids in order to protect resources.

maxhodges · August 31, 2015, 4:21pm

fyi relying on joins may make your application less scalable. Maybe not an issue for you, but thought it should be mentioned.

So the way I think about MongoDB is that if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important.

https://www.mongodb.org/about/introduction/

maxhodges · August 31, 2015, 4:23pm

normalizing documents in mongodb might be a bit like using Excel as a word processor

Transitioning from Relational Databases to MongoDB - Data Models
http://blog.mongodb.org/post/72874267152/transitioning-from-relational-databases-to-mongodb

Steve · August 31, 2015, 4:47pm

@maxhodges: IMO the extracts you have copied are just marketing nonsense :- ) More clarification here:

maxhodges · August 31, 2015, 5:16pm

This should help clear is up. It’s really quite common to denormalize data in mongo. Without knowing the specifics of the data @ayushchd is trying to model, I can’t say much. But hearing that someone wants to keep their data as normalized as possible makes be wonder if they are mismatching RDBMS approaches with NoSQL tech.

With MongoDB, you may embed related data in a single structure or document. These schema are generally known as “denormalized” models, and take advantage of MongoDB’s rich documents.

Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations.

http://docs.mongodb.org/manual/core/data-model-design/

Distributed joins are hard to scale

http://www.odbms.org/blog/2011/02/distributed-joins-are-hard-to-scale-interview-with-dwight-merriman/

ayushchd · August 31, 2015, 5:32pm

I do understand the denormalised data model. However, in most real-world data, there is a fair degree of relation among the data points and denormalising everything is probably not the best way to go in all situations - given the redundancy and increased complexity of update operations. I personally feel that any real-world app will have to perform joins to some degree or the other.

shock · August 31, 2015, 5:44pm

We are kinda mixing some of these models in ecommerce.
Import feeds are upserted into normalized collections.
And all data which is needed for product view basic info are combined into 1 collection of denormalized data.
Than down the view are accessories, related products etc which are queried from normalized collections as they are too coupled and would be hard to maintain synced in that 1 big combined collection.
But as everything still in development, mby we will switch to something else later.

maxhodges · August 31, 2015, 6:15pm

We are doing ecommerce in Meteor too (maybe we should compare notes @shock ) Of course we don’t have all our data in one big document, so we need to pull data from multiple subscriptions and folding things together in JavaScript when necessary. But a lot of our operational data is denormalized and we like it that way.

I think the key is to see your data with through two different lenses: differentiating operational data, which is read and written by applications, from analytical data, which is used to provide business intelligence (BI).

The relational model is a beautiful thing. Schemas help avoid errors. Transactions free developers from worrying about inconsistent data. Secondary indexes let applications access data efficiently using different keys, giving developers more flexibility. All of these are good things.

But these benefits come at a cost. For example, it’s hard to spread data across many servers—something that’s required at large scale—and still provide all of these features. It can also sometimes be challenging for an application developer to map the objects in her application to relational tables. And schemas can make it difficult to deal with changing data. In situations like these, the people creating an application might instead choose to use a NoSQL solution.

So we find running on nosql to have a lot of benefits, but then we push data from our operational system into other structures when needed, like pushing orders, payments and refunds into our accounting system (via it’s only API). Or you might want to push data into Amazon RedShift in order to create a data warehouse which can be queried with a wide range of familiar SQL clients.

here’s a decent article which discusses the differences at a high-level

maxhodges · August 31, 2015, 6:33pm

Or should I just re-think my data model and denormalise everything?

How about a hybrid approach? We built a large and complex application without those packages. We denormalized data when it makes sense. For example, at first we put Orders and LineItems into separate collections but later combined them (an order doc has an array of lineItems) because often when you want an order you need the lineItem data too. And in Shipments we persist a lot of data, like the customer address and lineItem details, which is denormalized but it makes sense because it’s point-in-time data.

So it’s a bit case-by-case and you have to make a lot of decisions based on the unique nature of your application. But I’d encourage you to try and get more comfortable with the mongodb way before trying to impose a more familiar, SQL way of doing things on it and try to put everything into fifth normal form.

I found docs like this to be very helpful with data modeling. Just read everything you can, and don’t be afraid to experiment and learn from experience.

http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1

Also you can look at other projects to get a better idea how they are doing things. For example, you can see the collection schemas here for libreboard, a Trello-style kanban app in Meteor
https://github.com/wekan/wekan/tree/master/collections
(maybe that’s a bit advanced, but also some great ideas in there)

ayushchd · September 5, 2015, 10:06pm

Thanks for the discussion everyone! I will probably take a middle ground with denormalisation at some places and join in others.