Most efficient sync of CSV file to MongoDB collection?


#1

Hi all. I’m trying to find the most efficient way to sync a CSV file that is updated every 30 mins with a MongoDB collection that is used for live queries and reporting in my meteor app.


Requirements

  • Import around 5 CSVs (of varying sizes) every 30 mins.
    • Small CSVs only a few hundred rows with 2-3 fields / row.
    • Large CSVs over a million rows with 10-15 fields / row.
  • Run reports on imported collections using MongoDB’s aggregation framework.
  • Publish imported collections to clients.

What I’m currently doing

When the CSV file is updated an external service to the main app uses mongoimport --drop ... to upload the data. This works fine for smaller CSVs/collections but is slow and quite resource intensive on the DB; I’m having to allocate 4.98GB of RAM on my compose.io deployment to ensure that I don’t get memory failures. Whilst the larger CSVs files are imported, and there is other load on the database, it sometimes happens that the main meteor app throws query errors on open cursors.

The reason I’m using the --drop option instead of just upserting the CSVs is to make sure that any deletions in the CSV are reflected in the DB collection.

What I’m currently considering to improve efficiency

  1. Import CSVs into a separate MongoDB and mirror to the main DB. I have no idea if this is something that MongoDB supports?
  2. Only import deltas between the CSV file and the collection. Possible approaches:
  3. On the external service, diff the previously imported CSV against the latest version and do appropriate INSERT/UPDATE/DELETE on DB collection. I’m thinking this could result in the CSV and DB collection falling out of sync though, right?
  4. On the external service, create a check/hash sum of all rows in the CSV. Then compare these against the previously saved check/hash sum for each doc in the DB collection and appropriate CRUD ops.

Any ideas/advice/guidance would be greatly appreciated! Thanks in advance!