Iterate over a large mongo collection

csiszi · February 11, 2020, 6:45am

Hi,

what’s the best way to iterate over a large mongodb collection? I have two approaches in mind:

const ids = Collection
  .find(selector, { fields: { id: 1 } })
  .map(item => item._id)
const chunks = _.chunk(ids, 100)
chunks.forEach(chunk => {
  const batch = Collection.find({ _id: { $in: chunk } }).fetch()
  doSomethingWith(batch)
})

Or:

const limit = 100
const count = Collection.find(selector).count()
let skip = 0
while (skip <= (count + limit)) {
  const batch = Collection.find(selector, { sort: _id: 1, limit, skip }).fetch()
  doSomethingWith(batch)
  skip += limit
}

I think I read somewhere that skip in mongo is expensive. Is there another way I’m not thinking about? Maybe the cursor has some method for this?

tomsp · February 11, 2020, 4:23pm

I think it depends what you wanna do with your docs. Maybe you can describe the use case some more and the bottlenecks you have encountered given your initial approach?

deank · February 11, 2020, 8:51pm

I would suggest to use mongodb aggregation framework

csiszi · February 12, 2020, 7:48am

The collection items are reservation data. I have to call multiple external APIs for every item (create an invoice, send the data to the government, etc). I haven’t reached any bottlenecks yet, I’m just curios what’s the most efficient way to iterate over the items in a large collection without hurting the server too much.

AFAIK aggregation framework can “reduce” the collection data, e.g. when I want to sum some fields.

I ended up with this helper function:

import _ from 'lodash'

/**
 * Query a collection in batches
 * @param {Collection} Collection meteor+grapher collection
 * @param {*} args args passed to createQuery
 * @param {*} batchSize number of items per batch
 * @yields {*[]} documents
 */
export default function* chunkQueryGenerator(
  Collection,
  args,
  batchSize = 100
) {
  // include arguments like $filter and $options
  const specialArgs = _.pickBy(args, (v, key) => key.startsWith('$'))
  // options for the ids query
  const idsOptions = { ...specialArgs, _id: 1 }
  const ids = Collection.createQuery(idsOptions)
    .fetch()
    .map(i => i._id)
    // create chunks from the array of ids
  const chunks = _.chunk(ids, batchSize)
  for (const chunk of chunks) {
    const $filters = {
      $and: [{ ...(args.$filters || {}) }, { _id: { $in: chunk } }],
    }
    const chunkArgs = { ...args, $filters }
    yield Collection.createQuery(chunkArgs).fetch()
  }
}

When I want to do something with a large collection of items, I can simply:

const itemsGenerator = chunkQueryGenerator(Items, { foo: 'bar' })
for (const items of itemsGenerator) {
  // items.length is 100
  doSomethingWith(items)
}

mrrk · February 12, 2020, 11:11am

The ids approach is a little unconventional but it’s probably your only option if the query you’re going to run on the docs is going to modify the order in which they’d be returned from the parent query.

Apart from that, in terms of what would be faster, it’d depend on the size of your collection, db indexes, how much of a factor network latency would be, etc. if you have an environment that looks something like what you think your prod environment will you could just run some tests…

But also it’s often a useful approach to just put good monitoring and logging in place and keep an eye on things over time. Early optimisation can be a real time killer.

drollins · February 12, 2020, 10:42pm

How about mongo db change stream - process each as an event fired when they reach where ever in their life cycle u need to do this work.

dimadev · February 17, 2020, 12:01am

Hi! We have a platform and use Amazon SQS when we want to process each item of a collection without hurting the server. Let me know if you are interested and I can tell you our approach!