Storing and fetching GBs of data

What I’m trying to do
I want to store 50-60 GBs of data per user and fetch information when user requests it. Users will upload their CSV file and Meteor will parse it and generates JSON data.
I want to store this JSON data and access it whenever I want with lowest server resources possible.

What I tried

  • MongoDB Document

I tried to insert the data into a MongoDB document but document size limit (16MB) made it impossible.

  • CollectionFS

As MongoDB docs suggest, I tried CollectionFS to bypass the document size limit. I can upload the CSV file but I couldn’t upload the JSON data. I’m fairly new to Node; therefore, I couldn’t manage to stream the data for CollectionFS.

Update: Now I can stream data to CollectionFS and upload data as JSON file by using GridFS.

My questions

  1. Even if I can manage to stream JSON to CollectionFS, it will store in .json file. Doesn’t this mean opening a huge file to fetch maybe 1 line in the file? I assume this will crush my server or generate a lot of scaling cost when even 100 users do this.
    Update: Since I manage to upload data as JSON file, a new problem occurred: JSON file is incredibly bigger than parsed file (15 MB CSV file data became 73 MB JSON file).

  2. What is the most efficient way to do what I’m trying to do?

  3. New Question: Should I use S3 instead of MongoDB for user data storage?

Use a microserice to do this conversion, maybe use some purpose-built lib?

http://papaparse.com/

1 Like

Thank you for your answer. I already use Papa Parse for parsing the user file. Papa Parse converts CSV data to JSON data. I just don’t know how to store that JSON data and fetch individual items in that data (e.g. specific column and row) effectively.

You would have to handle data redirecting streams to don’t blow the server memory.

But i don’t think Node.js is the best option for big I/O operations like that.

Thought about using some service? Like uploading directly to S3? You could let a background job from Meteor downloading after the upload and let the S3 handle de upload processing.

Thanks for your suggestion. I’m not familiar with S3; I mean I know it’s for object storing but I don’t know what exactly I can do with it for my product. My biggest concern is not being able to fetching data in chunks. I want to be able to get a specific one line worth of data when it’s requested. I will read S3 details related to this. Do you know anything about it?

I dont understand what could make json 60gb in size. If u have pictures there, save them as pictures, if some raw data, save them as raw data and save just pointers to them.

Mongo is data store, but 16m per document, thats why we have collections to filter data based on usage/schema etc.

Billions of rows can make a 60GB JSON; no pictures, it’s plain text. Worse part is 60 GB of data is not JSON yet. I haven’t converted anything that big yet but I parsed 15 MB CSV file with Baby Parse (Papa Parse for server) and saved the JSON result in MongoDB by using GridFS and that 16 MB data became 73 (SEVENTY THREE) MB JSON file. Either I’m doing mind-numbingly stupid things or MongoDB is definitely not suitable for my data.

Maybe you can consider uploading your data in batches (1000/batch). I don’t think the browser can handle those large size of data. Or use MongoDB bulk insert.

Thank you for your suggestion. Browser doesn’t do anything with those data except uploading the user file to Mongo for parsing in server. So basically Meteor takes the user file and store in a collection. Then, parse the data and save the parsed (now in JSON) data to a collection with GridFS (I’m using CollectionFS for this action). Then original user file is deleted. No problem so far. But at this point, I have two problems:

  1. Huge increase in file size when it’s stored by GridFS.

  2. I don’t know how to fetch data in chunks when searching something so I don’t end up filling server memory with unnecessary data.

I believe the new version of MongoDB (3.2) enhances compression, check this thread.

If I’m reading your post correctly, it sounds like you need to normalise your uploaded data…

If data are CSV with milion rows, why not just store every row as document ? I somehow dont understand what type of data it could be to be storable in CSV and not in mongodb natively - without need to uploading it as raw data.

If I dont need to know what is in these files, I would just store in mongo like header of it - Id of user and link to cloud storage. Than if these are public I can directly reach them from browser. If they are private, I would create Picker route which would serve these files as if they were on my server, but it would be just proxying and handling auth to cloud.

Thank you for your suggestion. Storing rows as document sounds like extremely unorganized to me. I know I can create an id for every user so server will know which data belongs to which user but there should be a better way in my opinion. Maybe I’m wrong, I don’t know, I’m pretty new to this thing.

My problem was streaming the data to MongoDB, I wasn’t implying that it was impossible to do such action. In fact, I managed to create JSON file and uploaded to MongoDB by GridFS. But file size got a lot bigger: 15 MB file data became 73 MB file. Let’s assume that I managed to minify this JSON data or be happy with the size at least in the initial phase but then I still have my other crucial problem: How do I read this JSON data in chunks (or similar process) so I don’t end up blowing my server.

I don’t necessarily need to know the content of the file but I do need to read them. The whole purpose of the app is reading those data. Since you mentioned cloud storage, what do you think storing those data in S3?

All I want is storing user data and fetch a couple of lines of data when user needs them and I want to do this without reading the user’s all data (as you would expect).

Well, maybe but right now all I want is to read portion(s) of user’s data when user request something and I want to do this without reading the whole data. I want to go specific sections of the file and fetch a couple of lines.

That is the correct way to use Mongo though (or pretty much any database). Without storing the rows as individual documents/normalising the database, your queries will be needlessly slow because Mongo will have to lookup the entire document when searching for that portion of user data.

I’m using GridFS so my guessing is it’s storing files in chunks and each chunk is a document. I’m still having problems with Node streams though, I can’t get it to work yet. What I understand is that when client request a data from the server, server goes to the file chunks and stream chunk by chunk until it finds the requested data (I hope it’s more efficient than this). But you said when rows are stored as documents, Mongo will not lookup entire file. How can it does this? (This is not an argument question, I’m really wondering how it works in that way). Doesn’t it at least start from the top and read document by document until it finds the data?

https://docs.mongodb.org/v3.0/indexes/

1 Like

Hi, @meteorusername!
In fastfred.de we work with data in mongodb.
We go around the whole file (xml, csv) and split into separate nodes.
We use nodejs (and cluster module) + redis based queue.