Simplest way to import 15M+ records from CSV to MongoDB? (mongoimport vs. loop and insert)


#1

I am looping over each row right now, building a a specific object, and then inserting that object into the collection-- for 16M documents. Is there a better way?

I’ve read mongoimport is faster, but no examples showing how to map the CSV to the object structure I need and then mongoimport. I also have seen a number of threads about how mongoimport can only do ObjectIds and meteor doesn’t support them?

How do folks do this (locally or into something like mongolab)


#2

I haven’t quite tackled 16m records in one go, but I have had great results with mikowals:batch-insert and harrison:papa-parse


#3

Relevant utilities and scripts

Notes

  • Take a look at schema.js and variety.js for examples on how to write shell extensions to the mongo database. For big data imports, it’s generally more stable to run the mapping scripts from the mongo shell than through the Meteor device driver.
  • Sometimes it’s easier to just import everything from CVS to JSON without mappings, and then run schema migration scripts.
  • db.copyDatabase() can be a great friend. Import the files into a staging database, run mutations and transformations and generate secondary databases, and then copy the results to where they will eventually be used.
  • Agreed that mikowals:batch-insert and harrison:papa-parse are both great if you’re going to do everything through a Meteor app. Also check out clinical:csv which includes papa-parse and other utilities.

But generally, yeah… importing 16M documents can take awhile. Sometimes the best solution is to just iterate over the rows, and accept that it’s going to be a batch run of a few hours.


#4

Thanks!

What about something like https://docs.mongodb.com/v3.2/reference/method/Bulk.insert/ ?


#5

CollectionName.rawCollection()…
Batch insert. Maybe make a limit counter and batch insert in groups of 1,000 etc?


#6

this is taking like an hour for every 50,000 rows/object. The method receives an array called data from clinet-side papaparse:
Meteor.methods({
‘utility.parseBusinesses’: function(data){
check( data, Array );

	let docs = [];

    for ( let i = 0; i < data.length; i++ ) {
      	let item   = data[ i ];
      	let companyNameLowerCase = item['Company Name'].toLowerCase();
      	let cityNameToLowerCase = item.City.toLowerCase();
       	let exists = Companies.find( { title: companyNameLowerCase, 'address.city': cityNameToLowerCase, 'address.state': item.State}, { limit: 1}).fetch();

      if ( exists.length === 0 ) {
      	let naicsArray = [];
      	let revenue = !Match.test(item['Annual Revenue'], Number) || !Match.test(item['Annual Revenue'], String) ? 0 : Number(item['Annual Revenue']);
      	naicsArray.push(item['NAICS Code']);
      	let doc = {
      		revenue: revenue,
      		title: companyNameLowerCase,
      		description: item.Desc.trim(),
      		naicsCodes: naicsArray,
      		industry: item.Industry.trim(),
      		address: {
      			address: item.Address,
      			address2: item.Address2,
      			county: item.County.trim(),
      			city: cityNameToLowerCase,
      			state: item.State,
      			zip: item.Zip,
      		},
      		modelType: 'company',
      		website: item['Company Website'],
      		employeeCount: item.Employees
      	}
        Companies.insert( doc );
        let deci = i/data.length
        console.log(deci*100);
      } else {
        console.warn( 'Rejected. This item already exists.' );
      }
    }

},

});


#7

It looks like you are doing an insert to the database on each item. This isn’t efficient. After doing the parse, either format/clean data on client like you have, or move that to the server also.

Instead of doing the insert, push all of the results to an array, then make a server side only meteor method, and pass the results array to the server.

In the server side meteor method,

Get the Companies.rawCollection() assigned to a variable, then do a batch/bulk insert

https://docs.mongodb.com/v3.2/core/bulk-write-operations/

Make sure the array is in the correct format, you can also break it into a few pieces if you think the entire array will be too large.

Also, remove the unique find by name on the client. Just change the insert to an upsert on the server. Mongodb will handle that logic for you, and will either update the company file for you, or create the record if it is new. You can’t upsert on client by non unique _id anyway.


#8

Also, just another important note. You are probably subscribing to the entire Companies collection on the client, since I see you are doing the
Companies.find( { title: companyNameLowerCase, ‘address.city’: cityNameToLowerCase, ‘address.state’: item.State}, { limit: 1}).fetch();

Remove that subscription/pub completely, and let mongodb handle that in the upsert logic.
Having that subscription running/find means that as the method is inserting documents, your server/client are downloading results, up to 15m? That will only make it much worse


#9

The above code is actually in a meteor.method (I should have mentioned this).

I’m testing it now without the lookup/mapping to the object and it takes like 1 minute per 50k

Thanks for your response, going through it now.


#10

I have a lot of duplicate data in the spread sheet, so I need to either clean it now or later right? I want to make sure I only have no more than one business with the same exact name in the same city and same state.

Should I just import all the CSVs, then migrate them to the final schema, than run a job over the 15M documents to delete duplicates?


#11

if there are duplicates, are they exactly the same? or do you want to combine them to get the most complete data?

Or if they are the same, do you want to just use one and eliminate the remaining? (in the spreadsheet, not in the database)

What field do you want to compare on, company name?


#12

I want to make sure I only have one AMCE company per city/state. But it is okay to have one ACME company in toledo, California and one in toledo, Arkansas. This is where I decided on:

let exists = Companies.find( { title: companyNameLowerCase, 'address.city': cityNameToLowerCase, 'address.state': item.State}, { limit: 1}).fetch();

I have one CSV per state that I’m uploading individually. They each have 100k-500k

if there are duplicates, are they exactly the same?

Yeah they’re basically exactly the same. My code above does exactly what I need, it’s just unfeasibly slow… I was saving everything in lowercase so I can check everything by lowercase.


#13

You could technicaly return the object in the correct mongodb bulkWrite format instead of doing a second update of the array. I also used map and created new arrays mostly to show how the data moves around. If the data is large, it may make sense to optimize by doing each/recursive updates to the original array, and modifying the cody to do one proper update to put everything in the correct format. Also, the lo_.uniqBy may not be the fastest, there is uniqWith and other js methods to search an array for unique values, also lo_.find etc.

import lo_ from 'lodash';

Meteor.methods({
  'utility.parseBusinesses': function(data) {

    check(data, Array);
    const docsArray = lo_.map(data, (item) => {
     
      const companyNameLowerCase = Match.test(item['Company Name'], String) && item['Company Name'].toLowerCase();
      const cityNameToLowerCase = Match.test(item.City, String) && item.City.toLowerCase();
      const naicsArray = []
      const revenue = !Match.test(item['Annual Revenue'], Number) || !Match.test(item['Annual Revenue'], String) ? 0 : Number(item['Annual Revenue']);
     
      naicsArray.push(item['NAICS Code']);

      return {
        revenue: revenue,
        title: companyNameLowerCase,
        description: Match.test(item.Desc, String) && item.Desc.trim(),
        naicsCodes: naicsArray,
        industry: Match.test(item.Industry, String) && item.Industry.trim(),
        address: {
          address: item.Address,
          address2: item.Address2,
          county: Match.test(item.County, String) && item.County.trim(),
          city: cityNameToLowerCase,
          state: item.State,
          zip: item.Zip,
        },
        modelType: 'company',
        website: item['Company Website'],
        employeeCount: item.Employees
      };

    });
    
    const uniqueDocs = lo_.uniqBy(docsArray, 'title');
    
    if (lo_.isArray(uniqueDocs) && uniqueDocs.length) {
      const CompaniesRaw = Companies.rawCollection();
      if (lo_.isFunction(CompaniesRaw.bulkWrite)) {
        
        // build the bulk array in the correct format for mongodb
        const bulkArray = lo_.map(uniqueDocs, (_doc) => {
          return {
            updateOne: {
              "filter": {"title" : _doc.title, 'address.state': lo_.get(_doc, 'address.state'} , // use lo_.get to safely access nested keys.
              "update": { $set : _doc }
              "upsert": true
            }
          }
        });

        //insert the bulk array into mongodb. Mongo limits bulk operations to 1,000 operations. 
        //It will split them up internally. 
        //So if it's too slow to insert the docs all together, 
        //split the array into smaller chunks that mongo can process.

        if (lo_.isArray(bulkArray) && bulkArray.length) {
          try {
            const bulkResult = CompaniesRaw.bulkWrite(bulkArray, {
              ordered: false // unordered writes are faster generally
            });
            console.log('bulkResult: ', bulkResult);
          } catch (e) {
            console.log('err in bulkWrite mongodb operation:', e);
          }
        }
        return;
      }
    }
  },
})

#14

I didn’t check this code, i just typed it quickly. There may be an error in the bulkWrite mongodb code, honestly haven’t used it much.

also, i am using lodash installed via npm. this is server side, imported as lo_

link:
https://docs.mongodb.com/v3.2/reference/method/db.collection.bulkWrite/#bulkwrite-write-operations-updateonemany


#15

I know you need to make sure the insert is unique, but here are some additional notes,


#16

Thanks so much for taking the time out to help.


#17

No problem, let me know how it works out, the benchmark you get, definitely interested.

Another thing, put an index for title/city/state fields, and any other fields you need to find/update on, and probably a combined index on title/state, in mongodb. You can do this by doing Collection.rawCollection().createIndex, check their docs for the options. You can put that code in a meteor.startup.


#18

I kept getting an error when I tried to put multiple indexes… I tried a couple variations and moved on. None of these seemed to work:

if (Meteor.isServer) {
  Companies._ensureIndex({
    "title": "text",
     "address.city: "text"
  });
}
if (Meteor.isServer) {
  Companies._ensureIndex({
    "title": "text",
     "address.city: 1
  });
}
if (Meteor.isServer) {
  Companies._ensureIndex({
    "title": "text",
  });

  Companies._ensureIndex({
     "address.city: "text",
  });
}

#19

I stay away from the built in Collection methods, since they are not updated.

(your code also had error with quotes on the address.city key, if it matters)

_ensureIndex probably does work, but it is being deprecated and turning into an alias
https://docs.mongodb.com/v3.2/reference/method/db.collection.createIndex/

Meteor.startup(function() {
  Meteor.isServer && Companies.rawCollection().createIndex({title: 1, 'address.city': 1, 'address.state': 1});
})

since the text for state city is pretty much a full text match, not searching within text, you shouldn’t need a text type index, you can just use 1/-1
stackoverflow link

if you do need text/multiple text, see here for compound text fields etc
https://docs.mongodb.com/v3.2/core/index-text/
and:
https://docs.mongodb.com/v3.2/tutorial/create-indexes-to-support-queries/

EXAMPLE
The following index on a collection:

{ x: 1, y: 1, z: 1 }
Can support queries that the following indexes support:

{ x: 1 }
{ x: 1, y: 1 }

#20

To see how the indexes work and analyze the query performance, you can download MongoDb Compass (or beta)

I have a compound index like above, with three fields, you can see the index still hits only with partial filter keys.