Some learnings about working with large data

macrozone · February 25, 2020, 7:28pm

I wanted to share some learnings from an older application we have on meteor.

Don’t use cursor.observe or cursor.observeChanges to trigger actions

its tempting to use observe to trigger some action (e.g. sending an email or doing some denormalization) using observe or observeChanges.

E.g. if you have a collection Activities which contains some activities of users. You might want to do:

Activity.find().observe({
  added: () => {
     // doing something when entry is added
  }, 
  changed: () => {
     // doing something when entry is changed
  }
})

this would be nice, because it would also trigger when you add data into your database from some other source. But:

it will trigger on every instance that runs of your application. If you do something like that, make sure to define one of your instances to be a “worker”, e.g. by defining some environment variable and make only that instance have these triggers (and cronjobs or similar)
it will trigger added whenever that instance starts for every document in the selection
when the observed set is big, it will take a lot of time and memory to start and will never settle down using that memory. I think that’s the biggest design flaw in that.

use hooks or methods and dedicated functions for actions and denormalization

Instead of using observe, use hooks instead: matb33:collection-hooks or better: be very very specific about what happens when something is mutated. E.g. add all functions that should be additionaly executed when adding a new entry inside the method or mutation that adds this entry to the db or even better put the whole body of your method inside another function and work only with these functions. Never use a Collection directly outside of these functions.

E.g.

const addNewActivity = (data) => {
   Activities.insert(data);
   sendEmailBasedOnActivityAdded(data);
   updateSomeDenormalizedStats(data);
   // whatever else should happen
}

These functions are also easier to test and reason about. It’s part of your business logic. You can import them into your methods or call them from everywhere on your server.

When using hooks or calling these functions inside of your methods, everything will run on whatever instance received the request. This has the advantage that you don’t have to have a dedicated worker for that. On the other hand, if you trigger some cpu-intensive (cpu-bound) action, the cpu of this instance will be blocked and it probably will be unable to process other requests. So be careful with that.

If you do migrations or generating cpu intensive exports from your data, you should do that with a cronjob and only do it on a worker-instance that won’t receive public requests.

be careful when iterating over large datasets (e.g. in migrations).

When you need to iterate over a collection, you might write:

Activities.find().forEach(
   activity => {
     // do some complicated stuff 
   }
)

This is not very memory consuming, because Activities.find() won’t return an array, but a cursor instead. BUT:

Because in meteor, most functions are sync and blocking, these iterations run one at a time and in sequence but might take a long time to finish, so if you // do some complicated stuff that take a lot of time, the cursor might timeout and you will receive: MongoError: Cursor not found, cursor id: <someId> (see https://stackoverflow.com/questions/46423442/mongodb-cursor-not-found)

In most cases its better to do:

const activities = Activities.find().fetch(); // will return the full array

// alternativly use something like
// const activityIds = Activities.find().map(a => a._id); 
// if you are only interested in the id of the documents and want to save memory

activities.forEach(
   activity => {
     // do some complicated stuff 
   }
)

this will take more memory, in particular if you do a fetch, but it will be more stable (and sometimes even faster), especially if your server is busy.

Bonus: logging progress of migrations, cronjobs, etc.

If you have something that takes some time to execute (e.g. a migration or an export), add a good log function right from the start that tracks the progress. Everything that can go wrong, will go wrong, so you will be thankful if you know where exactly it went wrong.

If you don’t want to have too much log entries, you can also try to ony log every n processed entries. E.g. use something like this:


const timeElapsed = {};
const progressLog = (message, index, count, func) => {
  if (index === 0) {
    console.log(`${message} | started, ${count} entries total`);
    timeElapsed[message] = new Date().getTime();
  }

  func();

  const steps = Math.ceil(count / 100);
  if (index >= count - 1) {
    console.log(`${message} | finished`);
  }
  if ((index + 1) % steps === 0) {
    const progress = (index + 1) / count;
    const now = new Date().getTime();
    const elapsed = Math.round((now - timeElapsed[message]) / 1000);
    const estimated = (elapsed / (index + 1)) * count;
    const left = Math.round(estimated - elapsed);
    const percent = Math.round(progress * 100);
    console.log(
      `${message} | ${percent}% (${index +
        1} / ${count}), elapsed ${elapsed}s, ~${left}s left until ${new Date(
        now + left * 1000
      ).toLocaleTimeString()}`
    );
  }

  // only log every 1 percent
};

export default progressLog;


// and then

const activities = Activities.find().fetch();
activities.forEach((activity, index) => {
  progressLog(`processing activities`, index, activities.length, () => {
    processActivity(activity);
  });
});

it does not handle error cases but this could be added easiy.