I am looping over each row right now, building a a specific object, and then inserting that object into the collection-- for 16M documents. Is there a better way?
I’ve read mongoimport is faster, but no examples showing how to map the CSV to the object structure I need and then mongoimport. I also have seen a number of threads about how mongoimport can only do ObjectIds and meteor doesn’t support them?
How do folks do this (locally or into something like mongolab)
Take a look at schema.js and variety.js for examples on how to write shell extensions to the mongo database. For big data imports, it’s generally more stable to run the mapping scripts from the mongo shell than through the Meteor device driver.
Sometimes it’s easier to just import everything from CVS to JSON without mappings, and then run schema migration scripts.
db.copyDatabase() can be a great friend. Import the files into a staging database, run mutations and transformations and generate secondary databases, and then copy the results to where they will eventually be used.
Agreed that mikowals:batch-insert and harrison:papa-parse are both great if you’re going to do everything through a Meteor app. Also check out clinical:csv which includes papa-parse and other utilities.
But generally, yeah… importing 16M documents can take awhile. Sometimes the best solution is to just iterate over the rows, and accept that it’s going to be a batch run of a few hours.
this is taking like an hour for every 50,000 rows/object. The method receives an array called data from clinet-side papaparse:
Meteor.methods({
‘utility.parseBusinesses’: function(data){
check( data, Array );
let docs = [];
for ( let i = 0; i < data.length; i++ ) {
let item = data[ i ];
let companyNameLowerCase = item['Company Name'].toLowerCase();
let cityNameToLowerCase = item.City.toLowerCase();
let exists = Companies.find( { title: companyNameLowerCase, 'address.city': cityNameToLowerCase, 'address.state': item.State}, { limit: 1}).fetch();
if ( exists.length === 0 ) {
let naicsArray = [];
let revenue = !Match.test(item['Annual Revenue'], Number) || !Match.test(item['Annual Revenue'], String) ? 0 : Number(item['Annual Revenue']);
naicsArray.push(item['NAICS Code']);
let doc = {
revenue: revenue,
title: companyNameLowerCase,
description: item.Desc.trim(),
naicsCodes: naicsArray,
industry: item.Industry.trim(),
address: {
address: item.Address,
address2: item.Address2,
county: item.County.trim(),
city: cityNameToLowerCase,
state: item.State,
zip: item.Zip,
},
modelType: 'company',
website: item['Company Website'],
employeeCount: item.Employees
}
Companies.insert( doc );
let deci = i/data.length
console.log(deci*100);
} else {
console.warn( 'Rejected. This item already exists.' );
}
}
},
It looks like you are doing an insert to the database on each item. This isn’t efficient. After doing the parse, either format/clean data on client like you have, or move that to the server also.
Instead of doing the insert, push all of the results to an array, then make a server side only meteor method, and pass the results array to the server.
In the server side meteor method,
Get the Companies.rawCollection() assigned to a variable, then do a batch/bulk insert
Make sure the array is in the correct format, you can also break it into a few pieces if you think the entire array will be too large.
Also, remove the unique find by name on the client. Just change the insert to an upsert on the server. Mongodb will handle that logic for you, and will either update the company file for you, or create the record if it is new. You can’t upsert on client by non unique _id anyway.
Also, just another important note. You are probably subscribing to the entire Companies collection on the client, since I see you are doing the
Companies.find( { title: companyNameLowerCase, ‘address.city’: cityNameToLowerCase, ‘address.state’: item.State}, { limit: 1}).fetch();
Remove that subscription/pub completely, and let mongodb handle that in the upsert logic.
Having that subscription running/find means that as the method is inserting documents, your server/client are downloading results, up to 15m? That will only make it much worse
I have a lot of duplicate data in the spread sheet, so I need to either clean it now or later right? I want to make sure I only have no more than one business with the same exact name in the same city and same state.
Should I just import all the CSVs, then migrate them to the final schema, than run a job over the 15M documents to delete duplicates?
I want to make sure I only have one AMCE company per city/state. But it is okay to have one ACME company in toledo, California and one in toledo, Arkansas. This is where I decided on:
I have one CSV per state that I’m uploading individually. They each have 100k-500k
if there are duplicates, are they exactly the same?
Yeah they’re basically exactly the same. My code above does exactly what I need, it’s just unfeasibly slow… I was saving everything in lowercase so I can check everything by lowercase.
You could technicaly return the object in the correct mongodb bulkWrite format instead of doing a second update of the array. I also used map and created new arrays mostly to show how the data moves around. If the data is large, it may make sense to optimize by doing each/recursive updates to the original array, and modifying the cody to do one proper update to put everything in the correct format. Also, the lo_.uniqBy may not be the fastest, there is uniqWith and other js methods to search an array for unique values, also lo_.find etc.
import lo_ from 'lodash';
Meteor.methods({
'utility.parseBusinesses': function(data) {
check(data, Array);
const docsArray = lo_.map(data, (item) => {
const companyNameLowerCase = Match.test(item['Company Name'], String) && item['Company Name'].toLowerCase();
const cityNameToLowerCase = Match.test(item.City, String) && item.City.toLowerCase();
const naicsArray = []
const revenue = !Match.test(item['Annual Revenue'], Number) || !Match.test(item['Annual Revenue'], String) ? 0 : Number(item['Annual Revenue']);
naicsArray.push(item['NAICS Code']);
return {
revenue: revenue,
title: companyNameLowerCase,
description: Match.test(item.Desc, String) && item.Desc.trim(),
naicsCodes: naicsArray,
industry: Match.test(item.Industry, String) && item.Industry.trim(),
address: {
address: item.Address,
address2: item.Address2,
county: Match.test(item.County, String) && item.County.trim(),
city: cityNameToLowerCase,
state: item.State,
zip: item.Zip,
},
modelType: 'company',
website: item['Company Website'],
employeeCount: item.Employees
};
});
const uniqueDocs = lo_.uniqBy(docsArray, 'title');
if (lo_.isArray(uniqueDocs) && uniqueDocs.length) {
const CompaniesRaw = Companies.rawCollection();
if (lo_.isFunction(CompaniesRaw.bulkWrite)) {
// build the bulk array in the correct format for mongodb
const bulkArray = lo_.map(uniqueDocs, (_doc) => {
return {
updateOne: {
"filter": {"title" : _doc.title, 'address.state': lo_.get(_doc, 'address.state'} , // use lo_.get to safely access nested keys.
"update": { $set : _doc }
"upsert": true
}
}
});
//insert the bulk array into mongodb. Mongo limits bulk operations to 1,000 operations.
//It will split them up internally.
//So if it's too slow to insert the docs all together,
//split the array into smaller chunks that mongo can process.
if (lo_.isArray(bulkArray) && bulkArray.length) {
try {
const bulkResult = CompaniesRaw.bulkWrite(bulkArray, {
ordered: false // unordered writes are faster generally
});
console.log('bulkResult: ', bulkResult);
} catch (e) {
console.log('err in bulkWrite mongodb operation:', e);
}
}
return;
}
}
},
})
No problem, let me know how it works out, the benchmark you get, definitely interested.
Another thing, put an index for title/city/state fields, and any other fields you need to find/update on, and probably a combined index on title/state, in mongodb. You can do this by doing Collection.rawCollection().createIndex, check their docs for the options. You can put that code in a meteor.startup.
since the text for state city is pretty much a full text match, not searching within text, you shouldn’t need a text type index, you can just use 1/-1 stackoverflow link