Accessing Meteor Packages from Child Process

micchickenburger · May 30, 2018, 9:02pm

Hello,

I have several server-side components of my application that retrieve very large, gzip’d files from the Internet on a periodic basis, uncompress, parse with JSON.parse; organize, normalize, and enrich the data; then insert into a Mongo db. The issue I’m having is that I frequently exceed the JavaScript maximum heap space, even with TOOL_NODE_FLAGS=--max-old-space-size=12288.

Therefore, I’ve decided to move each of these components to child processes.

const { fork } = require('child_process');
const path = require('path');

// Resolve proper application path based on dev or prod
let root = path.resolve('../../../..');
if (path.basename(root) === '.meteor') { // development
  root = path.resolve(`${root}/..`);
}

// Import data
const child = fork(`${root}/imports/path/to/script.js`);
const handleMessages = message => console.log(message);
child.on('message', handleMessages);

This works fine. The script loads and executes as a child process. However, none of my meteor packages can be resolved. If I use import { Meteor } from 'meteor/meteor'; in script.js I receive the following error:

/mnt/code/project/imports/path/to/script.js:1
(function (exports, require, module, __filename, __dirname) { import { Meteor } from 'meteor/meteor';
                                                              ^^^^^^

SyntaxError: Unexpected token import
    at createScript (vm.js:80:10)
    at Object.runInThisContext (vm.js:139:10)
    at Module._compile (module.js:616:28)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)
    at Function.Module.runMain (module.js:693:10)
    at startup (bootstrap_node.js:191:16)
    at bootstrap_node.js:612:3

If I change that line to const { Meteor } = require('meteor/meteor'); instead, I receive this error:

module.js:549
    throw err;
    ^

Error: Cannot find module 'meteor/meteor'
    at Function.Module._resolveFilename (module.js:547:15)
    at Function.Module._load (module.js:474:25)
    at Module.require (module.js:596:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/mnt/code/project/imports/path/to/script.js:1:82)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)

I’ve tried wrapping the code with Meteor.bindEnvironment like so, but that results in no errors and no code executing at all:

// Import data
Meteor.bindEnvironment(() => {
  const handleMessages = message => console.log(message);
  const child = fork(`${root}/imports/path/to/script.js`);
  child.on('message', handleMessages);
});

Any ideas on how to resolve Meteor packages from within a child process, or any alternative method for solving my problem? Thanks!

doctorpangloss · May 31, 2018, 7:52pm

Unfortunately, the problem your describing is universal to all programming languages and frameworks—parsing very large files in this manner will never work.

Creating workers may obscure this underlying problem with another. If subprocesses of node correctly process these files in a way that using your existing process of node doesn’t, assuming that your relative memory usage of the rest of the application is asymptotically zero: the only reason the worker would work is because of a lucky difference in configuration (e.g., a differently configured heap).

On the one hand, I don’t think the node heap is somehow mismanaged, but maybe it is.

On the other hand, if you can somehow rejigger your procedure to work in a streaming manner, you’ll be golden. Streaming means that you process chunks at a time, never storing the entire processes entity in memory in your application, helpfully “yielding” time in your Meteor process after every chunk processed so that everything else it’s doing doesn’t get bogged down.

It’s very rare that you’ll have a multi-hundred-megabyte single JSON blob that has no cleavable structure (i.e., line breaks between every logical “record”). If you do, you’ll have a very hard time using a streaming process. My guess is that you actually have many JSON “records.” Extract the JSON records to a directory, or parse each record until the “linebreak” (or whatever is the consistent record separator), and walk them to process one at a time. Cleanup the temporary directory every time you start the process, and use a document in mongo to indicate the last successful processing.

For example, supposed I have a giant multi-gigabyte JSON blob that looks like this:

[
  { "id": "record1", ...},
  { "id": "record2", ...},
  { "id": "record3", ...}
]

Observe the line breaks are helpfully at record separators. You can streaming-decompress the huge, single file into the file system. Then, open (but not read fully) the JSON file, reading each line at a time. Then parse, then enrich, then upload to your database or whatever. If you need to enrich in a way that requires awareness of all the data, use mongo’s $aggregate pipelines, which is architected for this purpose.

doctorpangloss · May 31, 2018, 7:54pm

Also, to answer your actual question,

micchickenburger:

Meteor.bindEnvironment(() => {
  const handleMessages = message => console.log(message);
  const child = fork(`${root}/imports/path/to/script.js`);
  child.on('message', handleMessages);
});

That’s not going to work with a node subprocess ever, and subprocesses have weird interactions with node/fibers, on which meteor depends. If it does work, it’s due to dumb luck. So you can keep hacking at this procedure, but you will still have problems. And even if you don’t have problems now, you’ll get a file that’s too large for the child process’s heap, and the problems will start again.

micchickenburger · May 31, 2018, 11:47pm

Thanks a lot @doctorpangloss! I don’t run out of heap space when parsing one component at a time. It’s the collective asynchronicity that causes me to hit limits. Basically, if the application starts for the first time, it needs to populate multiple collections from disparate sources. This all happens simultaneously (and it should), but a single process can’t always handle that. I’ve seen the heap space exceeded messages after the process pegs away at 100% (in just one core) without coming anywhere close to the configured heap limit (I think the garbage collector is taking a crap in this case.) And I believe that each child process has its own heap, right? That’s why I’m working on this approach.

I did try streaming, but the source data is not consistent on newline endings (records spread across lines inconsistently.) That being said, you did give me an interesting thought. I may be able to add a new mongo collection from a source JSON file, then modify/enrich.

Or I could spawn child processes that just don’t use Meteor libraries at all. I think my import scripts use Meteor libraries just for the db handles and simpl-schema. I could just create a new connection to library and do things the “old fashioned way.”

Thanks again.

doctorpangloss · June 1, 2018, 7:13pm

My suggestion then is to not fire off too many things at once, which I think you already know. “Collective asynchronicity” is pretty elegant though.

I think this is the right idea, and in particular because it basically makes something that works like a queue.