Should I be concerned about tons of tiny MongoDB writes?

ffxsam · November 1, 2016, 8:18pm

I’ve just launched a service that includes a detailed metric system, which records a listener’s interactions with HTML audio files on a very detailed level: when an audio file is played, stopped, skipped (seek), etc. So far, performance wise, things are good, but of course I just launched.

I’m concerned that when people share their music publicly with thousands of people, and there are maybe (thinking realistically here) 500-1000 concurrent users hitting play & seek—will this be a problem? My app is on Galaxy and the MongoDB is hosted by mLab. Are there specific performance metrics I should look at that would indicate there are too many successive DB writes?

nlammertyn · November 1, 2016, 9:25pm

Why not cache them and periodically (every 15s or every 1m or so) send the data to be batch inserted/updated? Writing every event when it comes through is going to break things really quickly.

If they’re essentially analytics I’d advise you to create a separate collection for that kind of data and not embed it in the audio file document (it could grow to 1000’s events quickly and you’re going to see slow downs when working with the data). Ideally I think you might use a bucket model for the data: https://docs.mongodb.com/ecosystem/use-cases/storing-comments/#hybrid-schema-design

Essentially you store all the events for an audio file as an array inside a document in a “audio events” collection.
The document has the id of the audio file, an integer that is his own count and a count of the number of events in its array. Optionally you could add a timestamp field for the first event and one for the last, that would make querying faster in the future.
You just keep appending data to that document’s event array until you reach a certain number, could be 100, could be 10000. At that point you insert a new document with bucket count +1 and keep appending data there.

If you work with batches of events you can optimize it a lot. Quick example:

The client has cached 50 events during 1m on audio file xyz123. Your bucket size limit is 100. On the server you receive the array of events. You get the bucket document for that audio file with the highest bucket id and only ask for the event count field. It says the document has 65 events in it. Now you only need to do 2 db operations (in many cases you’d only need 1, but it’s always going to be better than to do 50 operations separately):

Update that last bucket document with the first 35 events in your array you got from the client.
Insert a new bucket document with an incremental bucket id and insert the rest of the 50 events (15 in this case) in the events array.

Of course you could also do the caching server-side instead of client-side. In that case you keep a global variable that contains all events and have a method that transmits data from the client. On the server it appends it to the global variable. Same as before you can process that data periodically, like every minute or something along those lines.

That’s how I’d approach that kind of tiny data writes to make it more scaleable and reduce the likelihood of performance problems at the database layer.

efrancis · November 1, 2016, 9:29pm

do you have a Bulletproof Meteor account? they have a great course on high-velocity time-series data like you describe, it’s worth a go. the big things that might help you are

keep that data in a separate DB, could be another MongoDB on the same instance but put it in a different DB so Meteor isn’t processing it’s op log at all
instead of having a data model that keeps each event in its own document, store them in large documents, for example one document per each hour with sub properties for each event so you’re updating that one document each time instead of inserting a new doc for each event. updates are faster and require less disk space. you can optimize it further by calculating the updated sum of plays each time on a totalPlays field so that it’s only calculated when the data changes so you don’t need to waste cpu calculating it again more than once later

personally I would probably consider storing it in in-memory solution like Redis, and periodically dump it to db every minute or so

ffxsam · November 1, 2016, 9:32pm

Thanks for the replies! I’m going to go through them in detail when I have some time, but I quickly just want to add that I have no idea how long the client will stick around. In other words, they could show up, hit play, seek, and close their tab in less than 5 seconds. So I have to be able to store their activity immediately in a more permanent source. A global var on the server makes sense to me.

I have no experience with Redis and am not sure how that would tie into my current stack, and where that would sit in relation to MongoDB, and what advantages it offers.

I’ll reply at length to the above posts soon. Thanks again!

nlammertyn · November 1, 2016, 9:52pm

Redis would indeed be a possible addition to the aforementioned solution. Redis is a high performance db that lives entirely in RAM, so essentially it’s a queryable key-value store that’s extremely fast. If it’s only events I think it’s a bit over the top (global var should suffice in the beginning), but it could be a possible solution when scaling to very large numbers. You’d store all the events in redit and have a separate microservice process the data and store it as efficiently as possible in mongodb or something else.

ffxsam · November 1, 2016, 10:52pm

Ok, I’ll brain dump this back out to make sure I understand everything correctly:

User invokes an event of some kind (play, stop, seek)
Meteor method call sends details of this event to the server
Server side temporarily caches this information in a plain old JS array with other cached events
Once JS array reaches a certain length, OR after a time interval (60 seconds) batch insert the events into MongoDB

Now, the bucket part I’m not fully grasping. I’m assuming it’s a performance move so that audioEvents doesn’t wind up being a collection with 2 million documents in it, yes? So you group the events into an array that’s stored in a single document (bucket), and get as close to the 16MB storage limit as possible before creating a new document/bucket inside audioEvents. Do I have that right?

vigorwebsolutions · November 1, 2016, 11:04pm

Just another quick thought, if you’re already planning on keeping a local (non-db) cache of pending documents – you can use the batch insert functionality of mongo, either with the raw collection or a package like this one.

We started using this package to import client data, as importing a contact list of 5k+ was having a noticeable impact on performance. Went from ~10ms per insert to ~100ms per thousand inserts when batched.

ffxsam · November 1, 2016, 11:05pm

Yep, already aware of mikowals:batch-insert! I could also do it myself via rawCollection, but then it uses MongoDB native ObjectID for the _id field… which I guess isn’t a bad thing (especially since it has a timestamp built into it). mikowals:batch-insert converts the ID field to a Meteor type ID.

nlammertyn · November 1, 2016, 11:30pm

Exactly although it doesn’t have to do with any size limit. Creating documents with batches of data instead of millions of documents will increase querying speed a lot. The other extreme is putting all events from one audio file in a single document, but then you end up with a document that might exceed the limit and will take too long if you only need a certain time range.
Querying is the reason why it’s a good idea to include the first and last event timestamp as well, makes it easy to get all the events in a time range.
IMO, for analytic purposes the limit on one ‘bucket’ is probably better set a bit higher, something like 1000 events or similar.

ffxsam · November 2, 2016, 8:49pm

Thanks again, Nick! I learned a few handy things from this exchange and will be implementing them ASAP.