What's wrong with just one collection?

paryguy · June 1, 2015, 1:11am

I’m new to the whole NoSQL game and it seems I can’t get my mind or of many collections and distributed data models so I thought I’d just ask a real basic question: what’s wrong with using one collection for the bulk of my data? Say I have a ToDo app with appointment reminders and events. Why not store all tasks, events, and reminders in one collection with a type fields to distinguish the data? Is there any gain? If there’s no schema restrictions can this work? Thoughts, opinions, and anecdotes welcome!

This example from eBay offers a take on my example from above. If modeling a blog, one collection is for each blog post that has child documents for comments, tags, and categories. So if you took the example above and boiled it down to a single item like an event, say a due date for example. Each event would have child documents for associated tasks and reminders on the parent doc of an event. Even if you wanted to relate the event to say a project which can have many events, couldn’t you still model that on one collection?

Reference: NoSQL Data Modeling

cstrat · June 1, 2015, 1:31am

Someone with more experience will surely have a better response.

However what I have read is doing this would greatly reduce the efficiency with the pubs and subs.
Something around small changes might trigger larger chunks of the document being sent across the wire. Not sure if this is accurate though.

From my experience, this will also make life harder with queries as it is easier to search top level documents, rather than objects within arrays… Also updating complex documents is harder than top level documents. You end up needing to use dollar notation for arrays and it gets unnecessarily complex IMO.

paryguy · June 1, 2015, 11:43am

Thanks for the info @cstrat . Do you have any use-case where a mixed schema collection comes in handy? I gues technically if you have an events collection amd one event doc doesnt have all the fields of another that is considered mixed schema? Ive tried to find some reading on this within the meteor ecosystem but either the books aren’t complete or the info is out of date. Guess I should finish reading the Mongo docs! Thanks again for your time

Steve · June 1, 2015, 3:02pm

There is nothing wrong. It depends on your app. You got to have a solid reason, though, because it will add one parameter (and the associated overhead) to all your queries.

corvid · June 1, 2015, 3:11pm

Was wondering this myself actually. I am having a problem now where I have two collections, Meteor.users and a “detached” profile state.

Meteor.users.attachSchema(new SimpleSchema({
  // basic generic stuff up here.
  profiles: {
    type: Object
  },
  "profiles.player": {
    type: String,
    allowedValues: function () {
      return Players.find().map(function (doc) {
        return doc._id;
      });
    }
  },
  "profiles.admin": {
    type: String,
    allowedValues: function () {
      return Admins.find().map(function (doc) {
        return doc._id;
      })
    }
  },
  roles: {
    type: Object,
    optional: true,
    blackbox: true
  }
}));

Then I have collections to manage admins, players, etc.

The problem seems to be it makes for a lot of redundant data because I basically have to attach the roles object to the “profile” object as well as the user.

Is it better to just attach the profile directly?

Steve · June 1, 2015, 3:57pm

Try both schemes and determine:

the one that has less code
the one that has more efficient queries

paryguy · June 1, 2015, 4:16pm

I guess that’s the real question, is there any benefit in keeping just one collection for your data? Performance? Load time? The MongoDB docs say to “Do joins while write, not on read…” Clarity of code though would be an issue. I can see easily getting lost in loops and object iterations trying to sort or filter data. So any benefit would need to outweigh a little more time spent coding I guess…

I just am a little cloudy about the NoSql concept. It seems there’s a lot of questions, I’ve asked one myself, about handling relationships in Mongo. I don’t get why, to be honest. If it’s because Mongo does better pulling from less number of collections, then it seems one master collection if structured right is the ticket as opposed to many collections replicating relational data models…

Apologies if I’m going in circles, since starting Meteor I feel like I’m just doing what works without seeing the clarity behind “why” it works or “why” something is the way it is. Like learning calculus by just learning how to use a calculator. I’ve read several articles about big companies moving to or utilizing NoSql in some way but I still can’t see the forest through the tress because it seems like every app I start has some relational data in it.

copleykj · June 1, 2015, 5:32pm

Aside from the rare case where you only need like 2 collections and they are very similar, I think there are numerous reasons not to use just one collection. One of the biggest is indexing. Say you have numerous collections, each of which has a need for an index (possibly multiple/compound) on a different field. This may be okay when reading from the database but you’re write performance is going to be severely affected… On the other hand you could neglect the index and writing would be very fast, but as your collection grows (and at a much higher rate due to only using one collection for everything) the read performance will become almost unbearable for anything more than the simplest query.

Steve · June 1, 2015, 5:34pm

Every data model is relational. NoSql is about optimizing database access, not about changing data models. It is not a feature but a constraint, that you are willing to accept in exchange for better performances.

yasinuslu · June 1, 2015, 6:09pm

We had a bidding app in the past. We had auctions collection where we store most things related to auction in that collection. All the bids were stored in an array of that document. Actually it was a hash like bids.{{bidId}}.createdAt. Performance was a disaster because mongodb has atomicity based on documents. When multiple users places a bid at the same time mongodb was blocking next updates until previous one completes. We had huge performance improvements after we split frequently updated parts of that collection to new collections.

My advice is that dont try to store so many things in one huge document unless you have a very good reason to do it.

muaddib · June 1, 2015, 7:25pm

From a theoretical point of view there’s no difference.

You could write a program that simulate a SQL database inside mongo, in the same collection with some classic NoSQL data.

Even NoSQL and SQL difference exist only in our head, the truth being that NoSQL is a superset of SQL.

Data is data, and at the end of the day it’s all serialized.

From a performance point of view it makes a lot of changes. So if you use don’t have a high payload it makes no difference.

About sending data over the wire: I thought about that, but shouldn’t Meteor send only differential changes? I don’t load all the collection, and if the code is written correctly I should have/receive only the data I need?

What changes in my humble opinion is readability and maintenability. Debugging is hardest when you have written obscure code.

shock · June 1, 2015, 7:48pm

just open mongodb documentation and look projections.
try it on examples and you will see how much data is returned. for example array you return whole array, or 1 key from it, but you need to know that key position in array.

So yes more collections is better and you can always make list of _id’s to relate to it.
Or you can place whole document on top level of 1 collection and also to all places where u need it directly.
Not very effective from storage perspective, but if you need to return whole tree at some point, feel free to go for it.

lai · June 2, 2015, 3:35am

If you decide to nest data you will find yourself having a hard time writing queries to update or delete nested fields especially if you nest more than one level.

As a suggestion, DO NOT nest any more than one level or you’re screwed. What I mean is, if you decide to nest data, have an array of basic types or objects, but don’t do an array of objects with arrays in them.

Also one con of having nested fields is that you will not get fine grained reactivity in your pub sub. When you edit a nested field, the entire top level field along with its siblings will get sent down the wire.

ralof · June 3, 2015, 6:15pm

I guess it depends a lot of if you ever want to access the sub objects separately or not. For instance, if you have a main object and array of comments attached to it (which in turn might be nested) and you want to display “latest comments” somewhere, then it might give you some tricky/heavy queries.

Another thing to consider is that if you do a find() it might return a lot of data that slows down publish and whatnots, so you would probably have to mess with what it returns in various situations.

So, as many have said, sometimes it’s good, sometimes not. I think that a mindset saying “I WILL manage with only one collection!” could easily lead to trouble. Combine some things and let other be separated.

paryguy · June 8, 2015, 3:46am

I appreciate everyone’s time in responding. It’s clarified a lot on my end and helped me better structure a few pain-points. Thanks again everyone!

serkandurusoy · August 3, 2015, 6:18pm

Every data model is relational.

I have to respectfully disagree with this.

There are many different data modelling paradigms, only one of which is relational. But there are countless other cases where data is not relational. For example, time series data, logs, graphs (eg friendship on a social network), hierarchical models like documents, file systems etc.

You do not have to force yourself to think in terms of relationships since there are far better modelling patterns for so many different use cases.

Nosql, when compared to relational models, does have pros and cons. It is also not fair to define it as being about optimizing database access. It is far more than that (not in a better than relational sense, but in a this is very different sense) addressing big data concerns where the bulk of the data is more important then a unit of data that constitutes the bulk.

But yes, in the web/mobile apps world, the relational model seems one that’s needed more than others. Although mongodb is not built to cater to such needs, the volume and sensitivity of the kind of data most of our apps work with do not gain nor lose much by being on either kind of database since we can offload much of that work to our app design and think about such optimizations or even database changes much later down the road when we hit tens, even hundreds of thousands of users and millions of rows of data. But then it means the app is successful and it deserves fine grained tech stack refactors.

Edit: it is indeed very interesting how and why discourse have brought this thread up at the top of my new topics list, although it seems this is more than a month old. I feel like I’ve woken up the dead…

Steve · August 3, 2015, 7:37pm

Have you just said that graphs and hierarchical models are not relational?!! :- )

There is a confusion here about logical data model vs. physical data model.

So let’s clarify:

Any logical data model is relational, and, as far as I know, there is only one logical data modelling paradigm: relational modeling.
When it comes to implementing a logical model into a physical system, you always need to adapt it.
This adaptation is easier if your physical layer natively supports the relational paradigm. That is precisely why SQL has been designed in the first place. But, in real life, an adaptation is always required: the optimized physical model of a large SQL database is quite different from its logical model.
NoSQL is a physical layer that doesn’t natively support the relational paradigm. It means adapting your logical model (which, again, is relational) is not as easy as with SQL. Still, there are various ways to achieve it. The mongodb manual has a nice introduction.
denormalization is an optimization technique that has been around for decades. It has been used before SQL, then with SQL, it is used now with NoSQL, and it will still be used with NotQuiteSQL after we are all dead.
Like any manual optimization technique, denormalization should be used only when necessary, otherwise it just means more work (until there is an automatic way of applying it).

serkandurusoy · August 3, 2015, 8:49pm

@Steve if you are talking about differences between a nosql database (eg mongodb) and an sql database (eg mysql) then you are implicitly talking about a physical data model, wherein lies the argument “not everything is relational”.

If we are talking logical, even conceptual models, relational is still an opt-in paradigm where the business use case and the business model may or may not need relationships, eg time-series data.

Regarding graph and hierarchical models, I don’t think we should trivialize them by saying they are relational. In a relational model, the relations are mere static enumerations of the link between two data points whereas in a graph model the relations are first class dynamic building blocks of the model. Hierarchical models are another story where a relation is a mere encapsulation.

I get your point, especially with the reference to mongodb’s introduction which I believe to be no more than a marketing document which says “hey look, come back, we can do relational, too”. Don’t get me wrong, I have no beef against mongodb and I love its API, and I feel perfectly fine and comfortable designing schema structures to hold relational data models.

So we can say I mostly agree with your points 2-6. We just cannot agree on the semantics on 1