I am migrating a Windows standalone app (from C++) to Meteor, and I am looking for suggestions dealing with large, mostly-static, flatfiles.
My original C/C++ app contains several large data flatfiles some with up to 200,000 multifield records. They are parsed into data structures at startup, and the result is directly accessible to the app. The app rescans the resulting data up to 60 times/second, with different filter and merging criteria to produce the results I display. The app is for astronomy simulation.
In Meteor, I’d rather not deliver the flatfiles unprocessed to the client and have the client have to convert the data into JSON objects on each startup. The source files seldom change, or change slowly so I’d rather decode them into JSON objects as few times as possible. Most of the flatfile collections haven’t changed in 20 years, though some change about one per month, or once per day.
So, moving to Meteor, my first thought was to parse these flatfiles on the server at startup or when they change, putting the resulting data records in a mongo collection. Clients could then subscribe to the resulting processed collection, then take advantage of mongo’s sorting & filtering, and meteor’s reactivity.
But I discovered how slow Minimongo is when subscribing to collections with so many records. It takes several minutes for a single subscription to complete!
Is there a better way of handling large changing data sets such as this, yet maintain client side mongo features and reactivity?
My app, by the way, was the first star charting app for Windows, written in 1995 for Windows 3.1. I’m hoping to release a 25th anniversary version of ‘MyStars!’ as an online app in 2020.
The issue you’re facing is that each record has quite a lot of overhead when transferred via a subscription.
For the majority (the static part) of the data, I’d advise using a method call and batching results - I wrote a package that allows you to return iterative data from a single method call, but you can also just call a regular method repeatedly from the client. if you send a single response with 200,000 records, you might clog up the web socket for so long the connection dies. As the data loads, you can insert it into a client only collection (e.g., with a null name).
If you find that individual users are frequently requesting the data, you could look at persisting the data in something like grounddb in the browser itself.
For the “changeable” part of the data, if you’re caching the main body of data in the browser, you could add a parameter to your Meteor method lastUpdated. You’d also persist this value on the browser. Each time your app loads, you can ask the server for only the data thats changed since the last time you synced.
I’m glad you’ve got it working better - but give client side minimongo a try (null named collection) I think most of the overhead you’ve been seeing is caused bythe pub/sub mechanism (tracking which connections have each document, and which fields they have) - I don’t think minimongo is the problem
Well, I’ve run into an unexpected error, and may need to change my strategy again…
As mentioned, I am now converting a flatfile to objects then putting all the objects into a single array to write as a single record to the DB. This works for my smaller flatfiles, and makes pubsub nearly instant.
But I have one flatfile of ~1000 records, and using this process, the MyCollection.write(data: myArrayOfRecords) causes an error in the opLog:
BulkWriteError: write to oplog failed: BadValue: object to insert exceeds cappedMaxSize
Wouldn’t it be viable to do the sorting and filtering server-side?
Loading 132 thousands records on the client seems pretty heavy to me. Even if you get to find the fastest way to transfer the files from the server to the client, the client will need huge power/memory to do the filtering itself.
Imagine if when you get on amazon you receive the 132,000 items in store, and your machine has to do all the heavy lifting?
Now it loads fast, probably because you are on a development setup and the data does not need to go through internet?
Depending on the data types and what exactly needs to be aggregated you might want to use solutions made for just that, like Spark maybe?
I’m trying to avoid going back to the server as much as possible, as this would cause delays my simulations. Effectively, I don’t want to go back to the server at all after the first display is rendered. Loading a few large tables this way is similar to loading a few large images at startup - a few seconds.
What I have found effective though, is to deliver the data from the server as objects, not arrays, where possible (the table with 132,000 items is now one object with 132,000 attributes - each attribute being its own object). Clients are pretty fast at accessing such large objects.
Filtering is mostly a u/i function, so it need not be as fast as the actual animation of the filtered data.