Looking for Suggestions: Large data files

dpatte · June 11, 2019, 2:21am

I am migrating a Windows standalone app (from C++) to Meteor, and I am looking for suggestions dealing with large, mostly-static, flatfiles.

My original C/C++ app contains several large data flatfiles some with up to 200,000 multifield records. They are parsed into data structures at startup, and the result is directly accessible to the app. The app rescans the resulting data up to 60 times/second, with different filter and merging criteria to produce the results I display. The app is for astronomy simulation.

In Meteor, I’d rather not deliver the flatfiles unprocessed to the client and have the client have to convert the data into JSON objects on each startup. The source files seldom change, or change slowly so I’d rather decode them into JSON objects as few times as possible. Most of the flatfile collections haven’t changed in 20 years, though some change about one per month, or once per day.

So, moving to Meteor, my first thought was to parse these flatfiles on the server at startup or when they change, putting the resulting data records in a mongo collection. Clients could then subscribe to the resulting processed collection, then take advantage of mongo’s sorting & filtering, and meteor’s reactivity.

But I discovered how slow Minimongo is when subscribing to collections with so many records. It takes several minutes for a single subscription to complete!

Is there a better way of handling large changing data sets such as this, yet maintain client side mongo features and reactivity?

My app, by the way, was the first star charting app for Windows, written in 1995 for Windows 3.1. I’m hoping to release a 25th anniversary version of ‘MyStars!’ as an online app in 2020.

znewsham · June 11, 2019, 3:29am

The issue you’re facing is that each record has quite a lot of overhead when transferred via a subscription.

For the majority (the static part) of the data, I’d advise using a method call and batching results - I wrote a package that allows you to return iterative data from a single method call, but you can also just call a regular method repeatedly from the client. if you send a single response with 200,000 records, you might clog up the web socket for so long the connection dies. As the data loads, you can insert it into a client only collection (e.g., with a null name).

If you find that individual users are frequently requesting the data, you could look at persisting the data in something like grounddb in the browser itself.

For the “changeable” part of the data, if you’re caching the main body of data in the browser, you could add a parameter to your Meteor method lastUpdated. You’d also persist this value on the browser. Each time your app loads, you can ask the server for only the data thats changed since the last time you synced.

One last thing to consider, the data sent over the websocket is JSON text - not objects - so regardless of how you choose to deliver the data, the client WILL have to convert from text to javascript objects.

dpatte · June 11, 2019, 11:00am

Thanks for some feedback.

Last night I tried returning each collection (pubsub) as a single server-processed array of records (a single doc), and the client startup was almost instantaneous. Clearly the slowdown is the number of records being loaded into minimongo, not the total amount of data. I can then access the data from the array of records on the client side, but using javascript. We’ll see how fast that is.

znewsham · June 11, 2019, 12:53pm

I’m glad you’ve got it working better - but give client side minimongo a try (null named collection) I think most of the overhead you’ve been seeing is caused bythe pub/sub mechanism (tracking which connections have each document, and which fields they have) - I don’t think minimongo is the problem

dpatte · June 12, 2019, 12:15am

Well, I’ve run into an unexpected error, and may need to change my strategy again…

As mentioned, I am now converting a flatfile to objects then putting all the objects into a single array to write as a single record to the DB. This works for my smaller flatfiles, and makes pubsub nearly instant.

But I have one flatfile of ~1000 records, and using this process, the MyCollection.write(data: myArrayOfRecords) causes an error in the opLog:

BulkWriteError: write to oplog failed: BadValue: object to insert exceeds cappedMaxSize

Is there a maxsize for a doc, or for oplog docs?

znewsham · June 12, 2019, 2:33am

Yes. Mongo has a hard 16mb document limit. I think serving them directly from the flat file via methods is a better bet

dpatte · June 12, 2019, 3:25am

I now have a working solution:

At server startup, I process a flatfile into objects on the
server, and store them in a global array on the server.

At client startup, I use a method call to the server which simply
returns the processed array to the client.

I just loaded a flatfile of 132,000 records on the server. When I
ran the client the processed array then loaded on the client in
about 1 second. That is reasonable.

I also intend to also add a ready flag to the db, which I can use
to trigger a client reload using an autorun if the server data
changes.

I can now also look at whether I want to stuff the results on the
client into minimongo on the client for easier sorting and
filtering, or whether I’ll simply work with the array of data
using JS.

Thanks for your feedback.

doctorpangloss · June 12, 2019, 5:35am

The fastest way to deliver the data at this point is by emitting the data as an array declared in a Javascript file and adding a <script async> element.

Salketer · June 13, 2019, 2:37pm

Wouldn’t it be viable to do the sorting and filtering server-side?

Loading 132 thousands records on the client seems pretty heavy to me. Even if you get to find the fastest way to transfer the files from the server to the client, the client will need huge power/memory to do the filtering itself.

Imagine if when you get on amazon you receive the 132,000 items in store, and your machine has to do all the heavy lifting?

Now it loads fast, probably because you are on a development setup and the data does not need to go through internet?

Depending on the data types and what exactly needs to be aggregated you might want to use solutions made for just that, like Spark maybe?

dpatte · June 17, 2019, 2:33am

I’m trying to avoid going back to the server as much as possible, as this would cause delays my simulations. Effectively, I don’t want to go back to the server at all after the first display is rendered. Loading a few large tables this way is similar to loading a few large images at startup - a few seconds.

What I have found effective though, is to deliver the data from the server as objects, not arrays, where possible (the table with 132,000 items is now one object with 132,000 attributes - each attribute being its own object). Clients are pretty fast at accessing such large objects.

Filtering is mostly a u/i function, so it need not be as fast as the actual animation of the filtered data.