On multiple collections VS embedded documents

megaleon · March 24, 2018, 2:30pm

Hi, I’ve got a Meteor app which is going to hit alpha version soon and looking back to what I have created so far I am starting to get doubts on which would be the best approach for the app.

Shortly, the app is a project manager where users can create projects, and then set parts, expenses and events for that project (keeping the description simple for the sake of lenght). So I have two approaches in mind:

Multiple collections (What I have now):

Projects, parts, expenses and events all have separate collection.
Parts, expenses and events have a ‘project’ field which references the _id of the project they belong to.
On a typical page of the app, I have a lower level template which subscribes to the chosen project, and higher level templates which subscribes to the parts, expenses, or events for that project (based on which route the user is in)

Embedded documents:

Have just one collection for the projects
Projects’s parts, events and expenses are nested inside the project (e.g. project: { parts: [{...},{...}], expenses: [{...},{...}], events: [{...},{...}]}
On each page, I will just subcribe to the chosen project and render the different data based on which route the user is in

One one side I don’t want to fall into the ‘death by planning’ trap and would rather just ship what I have, on the other side this seems like a common situation in meteor projects that many developer have likely faced, so I ask: in a production environment, which approach is best suited to the usage my app would have? How would performance be affected in one against the other?

Thanks in advance

illustreets · March 24, 2018, 6:04pm

Personally I think this is the most clean and future-proof approach. It also helps with decoupling modules and packages nicely, in case you envisage some form of vertical separation later. Since Redis Oplog and Grapher have been released, you also don’t need to worry later about scaling.

We have built a large application, involving user collaboration in a geospatial context, and we went for a multiple collections approach.

vooteles · March 24, 2018, 8:36pm

Although using Mongo generally directs users towards nested documents, when using Meteor’s pub/sub for your data layer there’s an additional argument to take into account. This is described here:

The client will see a document if the document is currently in the published record set of any of its subscriptions. If multiple publications publish a document with the same _id for the same collection the documents are merged for the client. If the values of any of the top level fields conflict, the resulting value will be one of the published values, chosen arbitrarily.

Currently, when multiple subscriptions publish the same document only the top level fields are compared during the merge. This means that if the documents include different sub-fields of the same top level field, not all of them will be available on the client. We hope to lift this restriction in a future release.

So you might run into issues when you have your data in a deeply nested structure and want to expose bits and pieces of it to the user via multiple concurrent subscriptions. This alone I think is an important argument to justify separating data into multiple collections and avoiding (within reason, of course) nesting, and also for dealing with the overhead that results from this (client side joins, which usually means many round trips to the server to get all necessary data).

But then again, I’m not the first poor soul trying to fit a relational peg into a non-relational hole, so take it for what it’s worth.

mikkelking · May 19, 2018, 1:37pm

Separate collections means that you continually need to find related data from the database. These operations are cheap, but it means you have to write a whole bunch more code, and you are repeatedly doing this fetching,

It is better for reporting purposes, and possibly scaling, especially if the sub-collections could get large, it would bloat the master document.

There are a couple of things you can do to address this.

Store the sub-records in the main document, and also write them to the separate collection
Make sure that only a certain number of sub-records are kept (ie recent history)

This way you get the best of both worlds, the main document is complete, and has a cache of the recent activity, and the related table has all the data so you can do reporting across the full data set.

doctorpangloss · May 25, 2018, 6:47pm

Performance will almost certainly be faster with “embedded documents” (i.e., denormalized data), as that’s what Mongo was optimized for.

From the point of view of your expected usage, it’s probably not going to matter.

The impact of bugs will outweigh this architectural decision, and if you authored something that works right now, you don’t want to change things and introduce regressions.

Generally, you should denormalize as much as possible, and hew as close to a 1:1 correspondence between a Mongo document and what the client needs in order to render at any one time.

copleykj · May 25, 2018, 7:18pm

I’m sorry but this is all patently false as far as MongoDB use within Meteor applications.

doctorpangloss · May 25, 2018, 7:37pm

I think you’re right about getting awesome Meteor features like a table that reactively adds another row when a different user, in a different place, adds a thing that corresponds to that row.

But his question was about performance. Maybe that question was premature, I just want to answer his question.

I guess it would be more accurate to say that:

The closer your Meteor application, as a whole, operates like a cache, the faster it will be.

At once, that’s a tautology, and that’s a little too general to mean anything. Based on what I know of meteor and mongo’s architecture, that statement above concretely means a “1:1 correspondence between a Mongo document and what the client needs in order to render at any one time.”

It’s a little pesky, running your application like a cache. But it may illuminate why people report that packages like redis-oplog help their applications “run faster.” The architectural changes they make to use the package (or the flexibility they don’t know they’re losing) make their application operate more like a cache.

I think this is more of a consensus opinion that it may seem. From MongoDB:

Favor embedding unless there is a compelling reason not to

https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3

They give a lot of examples of situations when not to embed, which are very helpful. But if you read closely on the statement, “1:1 correspondence between a Mongo document and what the client needs in order to render at any one time,” you’ll realize that almost all the reasons not to embed in MongoDB don’t apply.

For example, while the actual expenses array that @megaleon has might “grow without bound,” (as per the MongoDB article for reasons not to embed), the number of expenses that a human being can read on a page is in fact limited to probably the small tens.

In this example, the thing the client needs to render is a page with tens of expenses. So make a document with tens of expenses, give it to your end user exactly, and render it very quickly. You’re moving cache invalidation (the idea that one metaphorical action like updating an expense may require updating dozens of “documents” corresponding to views at once) to the server, where it will be the most performant and easiest to ensure to be correct.

If someone needs to review thousands of expenses at once, the only way they do that is in a table with little scrollbars. So make a document that corresponds exactly to that table, with the columns it needs. It’s hard to write a spreadsheet in a website! Nothing about your architecture can make that easier. If your end-user needs this, consider giving them an Excel spreadsheet (which oftentimes, is what they ask for anyway).

copleykj · May 25, 2018, 8:56pm

Exactly and the way Meteor’s data layer is designed is precisely that compelling reason. For example…

Meteor’s mergebox only diffs on top level fields leading to data on the wire that might not need to be.
Top level diffing can lead to unexpectedly missing fields in documents when the same document is published subsequently with differing deeply nested fields.
You can’t paginate nested fields efficiently necessitating publishing ever growing document fields.

In Meteor, if you use mongo in a completely denormalized manor such as your describe, the performance of your application will suffer. You many not notice it on the server, it’ll probably hum along just fine, but the part that matters between the server and your users perception will.

If you’re not using Meteor’s data layer, then please, by all means use embedding. It will probably work out just fine.

doctorpangloss · May 25, 2018, 9:36pm

I guess it all comes down to what your profiling actually says. Mergebox is kind of a small piece of the puzzle…

copleykj · May 25, 2018, 9:43pm

You’re right, what do I know