Processing large CSVs in Meteor JS with PapaParse

allenfuller · December 30, 2016, 11:04pm

I’ve seen lots of discussions about PapaParse and large files, but not one that solves my situation. Appreciate any advice you have.

###Goals###

User uploads CSV from the client and then creates a map of fields (columns in their CSV to system fields)
File is loaded to Amazon S3
Process is kicked off on the server to grab the file from S3 and parse it, then process each row.

The whole process works, until I get to about 20,000 rows. Then I get:

FATAL ERROR: invalid table size Allocation failed - process out of memory

It seems like the memory crash happens when I try to grab the file from S3 and then store it locally via fs.writeFileSync. I think I can stream the file from S3 via s3.getObject(params).createReadStream() but that doesn’t return rows, just chunks.

Here’s my code as is. I would like to skip the fs.writeFileSync() step and just read from S3, but when I try that via PapaParse I get [] and BabyParse does not accept files.

Can I get rows from the chunks being returned by s3.getObject(params).createReadStream() and parse those?

S3.aws.getObject( getS3params, Meteor.bindEnvironment( function ( error, response ) {
  if ( error ) {
    console.log( 'getObject error:' );
    console.log( error );
  } else {
    console.log( 'Got S3 object' );

    let s3file      = response.Body,
        csvFile     = 'path/to/file.csv',
        writeFile   = fs.writeFileSync( csvFile, s3file ), // write CSV to local server -- this seems really silly. Want to just read from S3
        parsed      = Baby.parseFiles( csvFile, { // Note: using BabyParse not PapaParse
                        header: true,
                        step: function ( results, parser ) {
                          let thisItem = results.data[0];
                          // process this row
                        }
                      }),
        deleteFile  = fs.unlinkSync( csvFile ); // remove local CSV
  }
})); // end S3.getObject

Any ideas? Thanks!

efrancis · December 30, 2016, 11:44pm

Sure you can. Create a local buffer and store the chunks as they’re received from the stream. You can use an actual Buffer object if its’ binary data, otherwise you can just put it in an array. Check if that chunk contains a delimiter character (,) which means new column, or a new line character (/r or /n ) which means it’s the end of a row. If it’s the end of a row, process that row. That being said, processing csv streams has been solved before and you can probably a packag to do it.

a.com · December 31, 2016, 4:13pm

	handleUpload(event){
		this.setState({loading: true});
		const stopLoading = () => this.setState({loading: false});
		 Papa.parse( event.target.files[0], {
		 	header: true,
			chunk: function(results, parser) {
				console.log("Chunk:", results.data);
				parser.pause();
				const resumeParsing = () => parser.resume();
				 Meteor.call( 'utility.parseUpload', results.data, function(error, response){
				 	if (error) { parser.abort(); Bert.alert(error.reason, 'danger'); return; }
				 	Bert.alert('chunk done!', 'success');
				 	resumeParsing();
				 });
			},
	      complete( results, file ) {
	      	console.log('done!')
	      }
	    });
}

This is what gets called during the file inputs onChange. I have a react loader showing (hence this.setState). Maybe you have more columns, but for me this code breaks my 200k row spreadsheets into 50-80k chunks so the browsers doesn’t crash.

EDIT:
just read you are doing this on the server. I couldn’t seem to get babyparse to use the above chunk code for some reason (even though it says babyparse is an exact clone or papaparse and points to papaparse docs instead of having its own)

allenfuller · January 1, 2017, 3:36pm

Really grateful for the feedback! @a.com, I had seen your posts earlier wrestling with CSVs and learned a lot.

@efrancis: Thanks for pointing me in the right direction! Here’s the code I ended up using that streams data from S3 to a local file, then processes the local file. I’m sure there’s a way to simplify and just process the remote file from S3, but I need to move on.

This is called from a client-side event that parses just the first 3 rows of the file so users get a preview and can map the fields. The client-side event also uploads the file to S3 and returns a URL (using the AWS-SDK via lepozepo:s3).

var baby    = require( 'babyparse' );
var fs      = require( 'fs' );
var path    = require( 'path' );

let getS3params   = {
      Bucket: 'bucket_name',
      Key: file_relative_path
    },
    // Get an absolute path to the local Meteor installation. Regex matches a forward slash / at the end of the string ($)
    basePath      = path.resolve('.').split('.meteor')[0].replace(/\/$/, ''),
    // Use dir name w/ leading '.' so Meteor does not rebuild when files are added. NOTE: Folder must be present for this to work. In my case, added a startup function to create the folder if it doesn't exist.
    importDir     = basePath + '/.imports/',
    csvFile       = importDir + csv_file_name,
    writeFile     = fs.createWriteStream( csvFile ),
    getFile       = S3.aws.getObject( getS3params ).createReadStream().pipe( writeFile );

writeFile.
  on( 'error', ( error ) => {
    console.warn( 'Write file error.' );
    console.warn( error );
  }).
  on( 'finish', Meteor.bindEnvironment( ( error, response ) => {
    console.log( 'CSV writes complete.' );
    
    baby.parseFiles( csvFile, {
      header: true,
      step: ( results, parser ) => {
        try {
          // ADD PROCESSING TO MANIPULATE / INSERT ROW
        } catch( error ) {
          console.warn( 'parse step error:' );
          console.warn( error );
        }
      }, // End step function
      complete: ( results, file ) => {
        console.log( 'CSV file parse complete' );

        let deleteFile  = fs.unlinkSync( csvFile ); // Post-processing cleanup - delete the local file
      }
    }); // end Baby.parseFiles
  })); // end on.Finish

return true;