Best-practices for denormalization?

waldgeist · October 2, 2015, 7:52am

I would like to ask the community what your best-practice approaches for denormalizing data in Meteor are.

Just a sample: In many social apps, you would have your users directly linked to target objects, like posts or comments. So if you are displaying a post, you would typically display the corresponding user’s name as well. For performance reasons, in such 1:n relationships it would be better to store the username directly in the post’s document, to save an additional round-trip to the users collection. AFAIR, this is also the proposed way mentioned in Mongo’s data modelling documentation.

However, this also means that you would have to update each and every collection once the user changes the username. This wouldn’t concern me if Mongo had a concept like transactions, but because it does not, this would (at least potentially) induce the risk of corrupt data, which could be avoided by storing the user ids instead and something like reywood:publish-composite to publish joins instead. And I do not have any clues about how Mongo actually performs if you have to update lots of data in several collections. Or more generally speaking, how Mongo performs if collections grow larger, which is quite likely if you store lots of data denormalized and yet have a lot of redundancy in the data-model.

So, is there any best-practice discussion / tutorial / blog out there explaining good strategies for denormalization esp. in combination with Meteor? I’ve already read the recommendations of the MongoDB documentation, but did not find good Meteor-related recommendations so far.

lucfranken · October 2, 2015, 8:20am

Good question and follow up on Useraccounts: Is it possible to use the username as the Mongo _id? I guess. Next to the technical approach don’t forget to get the easy of development when comparing the options.

Also consider testing, you will have to mock the join out to do unit testing.

waldgeist · October 2, 2015, 8:42am

Yes, it’s some kind of follow-up. But the question is indeed broader. If it’s just the username to consider, it maybe easier to store it at the target object right away. But what if I also want to display an avatar next to each user? Ok, I could store the avatar’s url in the target document as well. But where does this end? Having a copy of a whole user record in every object that belongs to the user does not seem to be a clever idea to me.

Yes, ease of developing is always tempting. But I also have to consider the overall architecture, from both a performance and maintainability perspective.

Good point about testing. Haven’t thought about that. Since Velocity Jasmine is broken for 1.1, I cannot use it at the moment.

lucfranken · October 2, 2015, 9:02am

You can take a look at simple little templates (widgets), in React they have a name for it. So create a little template with username, avatar etcetera. Then you include that template in your other packages.

The template uses Template.autorun to load the user data. Then you have what you want but you get lots of loads. But might work without much hassle.

waldgeist · October 2, 2015, 9:10am

Yes, I am already using this widget-like approach. But the question how to subscribe for the data still remains. If the widget does this (as you describe), this will cause lots of additional requests. But you could also do it on the higher level object (that includes the widget), either using a reactive-join or by storing it directly with the target object. So, lots of options, but my main question was how the community typically handles this, and why they do so.

I have looked at some of the larger open source projects (like Telescope), and most of them seem to store at least some of the user-related info in a denormalized form. One did even store extended info like the avatar URL there. I was interested in the rationale the people lead to these decisions and their experiences with either approach (normalized vs. denormalized) regarding performance, server-load, and maintainability.

serkandurusoy · October 2, 2015, 9:25am

Normalized/denormalized storage of data is a concern with both sql and nosql worlds, regardless of presence of a transactional facility (which you can also achieve in the app level if the database does not support it) and usually boils down to read/write ratio.

A denormalization based design may strategically be preferred if the:

number of reads for a piece of foreign data is expected to be significantly higher than the number of writes (updates) to that data
state of the foreign data in a given point in time (ie price of an item in an order) should be immutable
deletion of foreign data is allowed but should not cascade to its references
frequency of updates in the foreign data, regardless of its reads, is expected to be low
foreign data does not possess significant transactional value
foreign data is not valuble to data/referential integrity

This is in fact not much different than a strategy based on caching (and cache invalidation)

In the case of avatars, if the avatar is:

always visible where user information is visible
user information is something that’s frequently available throughout the app

I’d go for denormalization because it would be read hundreds/thousands of times per update, and frankly, it would not do significant harm if the denormalized url is out of sync and image is broken.

Steve · October 2, 2015, 9:42am

What I do with avatars is to merge them with other documents at publish time. For example, when publishing a Posts collection, I would dynamically add the corresponding author’s avatars to the post document.

brajt · October 2, 2015, 9:45am

Personally, in case of avatars, I’m a fan of the “just name the avatar file after the username or ID of the user, this way you’re left only with hasAvatar flag” approach. I mean, we don’t usually keep the history of avatars anyways, so we don’t need the old files kept on the server, we can simply replace them.

I’m all up for denormalizing in cases which, like @serkandurusoy stated, won’t do harm to consistance of data. I’m doing it even with SQL. I denormalize not only things like username or user ID, but also things that may change much more frequently, like times and user data of most recent comment, sometimes even the text of the recent comment. That said, I can’t call that best practice, it’s more a personal one.

waldgeist · October 2, 2015, 11:02am

Thanks for this good reply!

waldgeist · October 2, 2015, 11:05am

@Steve and @brajt: Thanks for your replies, too. Naming the avatar after the ID is a very good idea. Wouldn’t use the username, though, since I would have to change the filename if the username changes. And if the file is located on S3, this will take some time.

mitar · October 2, 2015, 11:32am

Check out PeerDB. This package was made to decleratively/reactively keep denormalization in sync. You just define that you want to pull some fields (like username, avatar) from another collection and it will keep things in sync.

thebarty · July 24, 2016, 11:54am

Damn I wish there was a PeerDb-style solution compatible with SimpleSchema and Collection2

Decoupled from @waldgeist’s useCase: What stratagies/packages do you guys currently use for denormalizing data?
Are there any good ones that work with SimpleSchema & Collection2? Or do you guys avoid denormalization and use something like https://atmospherejs.com/peerlibrary/server-autorun?

waldgeist · July 24, 2016, 7:32pm

This PeerDB is really cool. Unfortunately, I would also need support for schemata (currently, I am using SimpleSchema and Collection2). Are there any plans to add this in the nearer future?

mitar · July 24, 2016, 7:45pm

There are tickets. Feel free to discuss it there. What use case you exactly need. Like what schema you need.

BTW, it is possible to use SimpleSchema with PeerDB, I believe. It is described in one ticket. But again, depends on what you mean by SimpleSchema support.

So, it would be great if you would explain your use case in a ticket in the repository.

vikr00001 · July 24, 2016, 7:48pm

One option is to avoid denormalization entirely and to use Apollo to do joins across MongoDB collections.

streemo · July 24, 2016, 7:57pm

It is a broad question, and the broad answer is “it depends on how much your data changes over time”. Denormalizing a username into a collection of posts is reasonable, because a username usually either does not change, or changes very little. You need to ask yourself: Do I want my users to be able to change their username whenever they want to? If the answer is yes, then your job as a backend engineer just got a little bit harder, because now you’re trying to denormalize time dependent data which means you will have to update it. In most cases, usernames change very rarely, so it’s not a huge problem.

But that’s username data. We understand, more or less, how username data behaves. It changes very little over time.

What about if you tried to denormalize something like the author’s total number of likes? This makes a lot less sense. The authors total number of likes is going to change very frequently over time. It’s probably better in the long-run to just query the Author collection and get the fundamental data every time you need it.

The answer depends on your data. I usually try to normalize as much as possible. I only denormalize if I know that the denormalized data will NOT CHANGE over the lifetime of the parent object.

waldgeist · July 24, 2016, 8:14pm

Thanks, will do that!

buishi · September 29, 2016, 10:43am

Do you have any sources that can expand on your comment about Apollo?