Speeding up filtering on associated documents

andregoldstein · May 29, 2019, 1:52pm

Hi guys, I’m running into a bit of an architecture quandry. I think I know the answer but was just interested in seeing how other people go about the same scenario.

I have a billing part of my application, this comprises of various collections, Items, Bills, Invoices, Payments etc. In the goal of trying to avoid data duplication and also errors, things such as whether the invoice is paid or late depend on using helpers to query across collections (ie: paid is when the total payments associated with this invoice equal the total price).

I originally thought this was the best way of ensuring the data doesn’t get compromised and also minimises the amount of columns needed to store in the database but am realising it does make querying/filtering this data really quite slow.

Do most people therefore to decide to store everything directly in the collection (and just set/update this data) on every insert/update? Or are there other ways of keeping things as they are but increasing the speed upon filtering?

Thanks in advance

jkuester · May 29, 2019, 4:24pm

I think the normalized data model is a good way to go. Do you use indexing on these specific fields?

If not you should try it out, it speeds up things a lot. Creating an index can be achieved using the native mongo driver which you can access using rawCollection or rawDatabase.

Another interesting factor is using hint instead of sort if you want revert the traversal from the newest to the oldest: https://docs.meteor.com/api/collections.html#Mongo-Collection-find (read the hint option)

andregoldstein · May 29, 2019, 4:56pm

Thanks so much @jkuester for the reply and the helpful links. I’m definitely using the normalized approach here. I’m not entirely sure which fields I should be indexing to be honest but I will definitely look into using more of them. I also think that perhaps reads are more common than writes here so perhaps storing more data in the database as opposed to computing them in JS might be more performant. But I worry about the risk of data getting out of sync.

jonlachlan · May 30, 2019, 1:37am

You could create an index on dueDate on Invoices as well as the invoiceId on Payments. If your query includes multiple fields, then you want a compound index on exactly the fields that will be on that query. If you have another query with another set of fields, create another index on that too.

Note that you can’t make indexes that span multiple collections, so you can’t make an index that includes dueDate on Invoices and also invoiceId on Payments.

If you want to find the sum of payment amounts (in order to compare them to the invoice amounts), you could do an aggregation on Payments, which is designed to be faster as a Mongo internal than doing the sum after a find().fetch():

async function getPaymentSums (listOfInvoiceIds) {
  return await Payments.rawCollection().aggregate([
    // in this scenario, invoiceId should be indexed
    { $match: { invoiceID: {$in: listOfInvoiceIds} } }, 
    { $group: { _id: "$invoiceId", total: { $sum: "$amount" } } }
])

Thinking of architecture, if you can run a query at the time of insert/update (as you mention) by which you can find out an invoice is paid, isn’t that what would be the result every subsequent time that query runs, and wouldn’t you want to save that information on Invoices…?

andregoldstein · May 30, 2019, 11:39am

Thanks for a brilliant reply, I’ll confess I haven’t had to spend too much time focussing on indexing my data until now as the need hasn’t really been there, I’ve never noticed any real lack of speed until performing these operations.

There are a lot of new concepts in your query above which is really interesting but I’ll clearly have to spend time examining them, $match, $group and rawCollection() even. Up til now I’ve mainly got by using simple find() queries with $and and $or

You are right, if I compute the associated values (eg: if an invoice is paid) when performing the write operation then the reads will be much faster, this may well be the way to go.

Currently a LineItem is storing the price, quantity etc. The associated Bill and its associated Invoice aren’t actually storing any information other than being referenced by association. So for an invoice to get its amount gross/net, it’s performing calculations as opposed to them being stored in its own record, if that makes sense?

jonlachlan · May 30, 2019, 4:02pm

It sounds like you’d want to add an index to the association fields, for starters, for example LineItem.rawCollection().createIndex({ invoiceId: 1 }). I do think an aggregation (sorry I forgot to reference the docs on that) might help you out, since you’re getting the sum on a field. You could at least see if it runs faster than doing a find query.

The rawCollection() is where Meteor exposes the db.collection that would be standard Mongo outside of the Meteor context (but with a bit more complexity).

You’re definitely running into a situation where the cost (i.e., computation time) of running your analysis queries could be high, as you’ve experienced. Figuring out the right fit for indexing should help. If the speed is still an issue, front-loading that computation cost by storing it somehow is another approach.

bradzo · May 31, 2019, 1:59am

Have you had a look at Grapher? https://github.com/cult-of-coders/grapher
Good intro here: Grapher - Collection Relationship Manager + GraphQL Like Data Fetcher
Theodor and his team at Cult Of Coders have done a fantastic job with that package.

Just to give you some idea of the speed, a query of mine has pulled 280 main records, while also getting info from 3 other associated collections, and assembling them into a result set that I then use as a data source for Amy Slagle’s reactive-table https://github.com/aslagle/reactive-table, and from begin query to external mongo db (not on same dev machine) to finish table render takes 0.406 seconds.

Grapher does take some time to become familiar with, but for querying data, it’s brilliant.

They also provide a link to a web app called Graphical Grapher https://github.com/Herteby/graphical-grapher, which, after you’ve set up all the links between your collections, allows point-and-click query generation and execution, right there in the browser. That allows you to see the results of the query instantly, so if you’re happy, you just copy the query and use it in your app. Very easy and a great tool.

Have a look too, into reducers in Grapher as a way of being able to calculate things during execution of the query.

Hope that helps!

andregoldstein · May 31, 2019, 1:43pm

Thanks for the heads up on Grapher. At the moment I’m using pub/sub on my table views and updating the query using various filters. I suppose in of itself using Meteor methods may be more performant (I don’t really need the reactivity on my tables.

Would methods be faster due to lack of reactivity?

Only thing there would be losing the “collection-helpers” I have on my collections, enabling me to call (for example), bill.invoice.amount in template code.