MongoDB Best Practise

cxoncloud · August 7, 2020, 4:40pm

Hello meteorites,

What in your opinion is the best strategy to design the MongoDB collections?

Multiple collections with smaller fields normalised using referencing or
Consolidated Collections denormalised with larger fieldsets

We keep getting this error when we push our app
Query Targeting: Scanned Objects / Returned has gone above 1000
The ratio of documents scanned to returned exceeded 1000.0 on <>, which typically suggests that un-indexed queries are being run.

And when checked in the profiler, this points to a consolidated collection with larger fieldsets.

Kindly advise.

Thanks
Jay Thattai

znewsham · August 7, 2020, 5:25pm

We’ll probably need more information on this - having a single large collection isn’t really an issue (as opposed to many small ones) - in fact, Mongo has some limitations about the number of collections you can use. Is your question related to the number of documents in the collection, or the number of fields in each document (or possibly the size of each document)? In general, compound indexes can work wonders with improving your scanned to returned ratio.

copleykj · August 7, 2020, 6:16pm

The simple answer here is that Meteor’s native data layer works best with shallow data. You can choose to implement this via normalization if you wish, and I find a mixture works best.

As far as your error goes, what it is telling you is that some query that is being run is scanning a very large dataset to return a very small one. I’m not sure exactly what it means by ratio exceeding 1000 but it could very well it scans more than 1000 documents to return just one. This should be easily fixed with an index.

cxoncloud · August 7, 2020, 11:06pm

Thanks Zack and Copleykj.

Our app contains a bookings collection and also creates a separate collection for individual businesses that holds their specific bookings collection data.

Say, we have 250 customers with 4 branches each and there are about 1000 collections and each booking collection has about 100 fields as it contains not just bookings but also customer information, their feedback et Al.

Should we strip off having individual collections and work off the generic bookings collection or

Break down the bookings collection to smaller collections to reduce the 100 fields to say 25 fields each keeping it shallow?

Appreciate your time and effort to respond. Thanks.

Jay

znewsham · August 7, 2020, 11:37pm

Yeah, this is for sure not a good approach - Mongo has a hard 24000 limit for namespaces (you’re a way off yet!) but before you get close to that limit, you’ll start having performance issues with the number of collections.

There are two schools of thought for this:

move to a single tenant environment (every customer gets their own DB)
move to a single collection per “collection” and have a companyId or similar indexed. I’m a little intrigued about how you got meteor to play nice with 1 collection per customer, as it doesn’t generally like this!

Which of these you choose will likely depend on your business plan and the size of your customers - single tenant is easy to reason about, but it gets quite complicated to maintain 1 deployment per customer. Multi tenant with a single collection is easier for devops, but you have to be absolutely sure your code doesn’t leak information between customers.

Unless you choose to go single tenant, I’d just start with merging all those collections (assuming the schema is the same across them all) - no need to split out the collections horizontally 1:1 mappings between collections are rare - and so long as your documents aren’t close to the 16MB cap, there aren’t a lot of advantages to splitting out (there are some, but they’re pretty specific).

The errors you’re seeing can likely be resolved by adding appropriate indexes, if you’re using Atlas you get a profiler for free that will tell you which queries are ran most frequently, which are slow, and which do/don’t use indexes.

cxoncloud · August 7, 2020, 11:55pm

Thanks for the rational Zack.

We also have a merged single multi-tenant collection that we only use for group analytics…The profiler points this collection to be the black sheep.

We aren’t indexing the individual customers data retrieval from the single multi-tenant collection. We went ahead with the 1-collection per customer to ensure speed & data seclusion…

But, we may have a few big orders coming our way and am a little nervous about our data architecture for scalability!