Optimizing joining two collections into a third collection

maxfi · March 3, 2016, 4:08pm

Hi all! I’m hoping that someone might be able to point me in the right direction on how best to essentially JOIN two collections into a new collection.

I’m looking to do this in the fastest possible way, whilst keeping resource utilisation in check, and am happy to entertain using a different stack to achieve this if required.

What I have so far:

In MongoDB (compose.io):

Collection | Foreign Key | Documents | Size on Disk
-----------|-------------|-----------|-------------
Invoices   | -           | 118882    | 556 MB
LineItems  | `invoiceId` | 1384724   | 1.47 GB

These two collections are automatically created every few hours by importing two CSV files into MongoDB using mongoimport ... --drop --headerline --type csv --file FILENAME.csv

Meteor micro service (server only) to do the processing:

Invoices = new Mongo.Collection 'Invoices'
LineItems = new Mongo.Collection 'LineItems'
ProcessedInvoices = new Mongo.Collection 'ProcessedInvoices'

process = ->
  startTime = new Date()

  # Ensure indexes every time (could have dropped/inserted DB since last run)
  LineItems._ensureIndex invoiceId:1

  # Add LineItems to Invoice and save as ProcessedInvoice
  Invoices.find().forEach (invoice) ->
    invoice.lineItems = LineItems.find(invoiceId: invoice._id).fetch()
    ProcessedInvoices.insert invoice

  # Finish
  console.log 'time', (new Date()) - startTime

process()

result:

{
  _id: "invoice-1",
  someProp: "",
  lineItems: [
    {
      _id: "line-item-1",
      invoiceId: "invoice-1",
      someProp: ""
    },
    ...
  ]
}

This works but is PAINFULLY slow! Read: it takes hours!

Questions I have:

Could I optimise this by coding it differently?
Am I using the wrong stack (meteor/mongodb)?

Bonus point questions:

Can I make this run reactively when there are changes in the Invoices and LineItems collections? Note: these two collections will only ever be modified with the automated mongoimport ... --drop --headerline --type csv --file FILENAME.csv.
What should I be looking for in the compose.io Monitoring section to see if I’m overutilising the database?

robfallows · March 3, 2016, 4:17pm

You should read up on the MongoDB Aggregation Pipeline and refer to the $lookup operation, which effectively gives you joins (MongoDB 3.2).

Basically, let the database do the work.

maxfi · March 3, 2016, 7:22pm

Thanks @robfallows. I was under the impression that MongoDB 3.2 wasn’t supported by Meteor? Otherwise that’s definitely how I would have approached this problem.

maxfi · March 24, 2016, 4:31am

Just following up on my question, MongoDB 3.2 does work with Meteor 1.2.1, at least I’ve tested the Compose.io offering. Check out their blog post about connecting. You’ll also want to check out their post about connecting to the oplog.

TLDR;

Use one host (Mongos/Haproxy) in the URI. Still provides high availability! E.g, mongodb://<user>:<password>@aws-us-east-1-portal.7.dblayer.com:10764/examples?ssl=true
Don’t use the oplog (as per the article it might not even be worth it for your app).
MongoDB 3.2 $lookup is awesome alongside the rest of the aggregation framework. And super performant!
- Use meteorhacks:aggregate rather than messing around with native-mongodb-driver.

Hope this helps someone!