How much storage does MongoDB allocate per document?

msavin · December 15, 2020, 9:48am

Hey folks, having trouble getting a clear answer for this even from the MongoDB support team.

Does anyone know if MongoDB allocates a specific amount of storage per document, regardless of how small it might be?

I’m thinking to store some activity logs there for convenience, but if each one takes up a minimum of 4kb, it can get expensive quickly!

robfallows · December 15, 2020, 10:14am

Chapter and verse is here: http://mongodb.github.io/node-mongodb-native/schema/chapter3/

However MongoDB stored the original document it added a bit of empty space at the end of the document hence referred to as padding . The reason for this padding is that MongoDB expects the document to grow in size over time. As long as this document growth stays inside the additional padding space MongoDB does not need to move the document to a new bigger space thus avoiding the cost of copying bytes around in memory, and on disk.

peterfkruger · December 15, 2020, 10:21am

I’m not sure if MongoDB is the right tool to persist log entries, especially if we’re talking about large amounts.

Apache Kafka is designed to process logs as streams, so as long as processing/reprocessing of the logs should be sequential, that would be the weapon of choice. We use Kafka for a distributed, scalable log processing, and are very happy with it.

If you don’t plan to reprocess the logs in the same order as the entries were initially created but want to be able to explore / analyze / search or filter, ElasticSearch comes into mind.

msavin · December 15, 2020, 10:29am

Agree on logging - just thinking more generally in terms of storing data on individual documents vs as sub-documents…

peterfkruger · December 15, 2020, 10:53am

I would create a collection per log type and a document for each entry, and ignore the padding as a given. I would be carefully consider the magnitude of the entries though. MongoDB can take huge amounts of data, but it still isn’t its forte.

msavin · December 15, 2020, 12:44pm

IMO ignoring the padding will cost too much, especially in an application where keeping history is important.

You can easily store 100 entries per document, and then “shard” those documents using some simple upsert logic. You could be saving as much as 100x the cost on the storage for those entries

For an activity or notification log, you could easily shard it by day and then X entries. It would probably be easier on pub/sub as well.

peterfkruger · December 15, 2020, 3:52pm

Speaking of the single log entry per document idea, you can’t know for sure if padding will cost you too much. The documentation mentions “a bit of empty space at the end of the document”. Also a factor is the size of your log entries. So ultimately, if your log entries are very small, and the “bit” of padding turns out to be not so small (which appears to be a potentially version dependent implementation detail in MongoDB), then yes, padding can be unreasonable expensive. But if your log entries are rather large, and the “bit of empty space” is in fact really small (that we don’t know), the padding won’t bother you.

Sure you can insert batches of log entries in the same document. It would reduce the loss caused by padding, but only as long as you manage to write each batch of log document at once. Otherwise, i.e. in case of upserts you’ll unavoidable fragment that collection, which will also exacerbate the padding problem rather than reducing it.

There are however potentially other complications with the batches of log entries per document if you try to do inserts only (no upserts):

Obviously you would have to cache the log entries somewhere until your batch is deemed complete
If your app needs to be scalable, this log-batch cache needs to be accessible from any number of servers.

I still think that MongoDB is not an ideal storage for logs. The longer I think about it, the worse it seems. I would not do it. Kafka / Elasticsearch / Splunk / or maybe a shared redis instance is what I would go for, if I were you.

msavin · December 15, 2020, 4:02pm

A lot of it depends on what size this padding is. If it’s a few kilobytes, it’s still “small”, but quickly adds up in gigabytes.

I believe Linux allocates 4kb per file by default… I don’t know if Mongo stores each document as a file, but if it does, it’s quite a lot as most documents are in the byte range.

As for the focus on logging, I wouldn’t get too hung up on that when it comes to this problem. If you build a feature like a notification system, it can quickly grow to be one of your largest collections.

Now for the conspiracy theory MongoDB definitely has a business interest in using data generously.

peterfkruger · December 15, 2020, 4:41pm

No, it doesn’t. Each collection is stored as two files, one for content, one for the index data.

macrozone · December 15, 2020, 4:50pm

also keep in mind that wired tiger does compression.

you could make a test and store some tousands of docs and check the size of the collection. I think you can see both the compressed and the uncompressed size

eleventy · December 21, 2020, 4:09pm

According to this site, the wiredtiger engine no longer uses padding. So, you should be fine in storing log entries as single docs.

I do the same, using ostrio:logger, to put everything in Mongo. But I do make it a capped collection:

coll = new Meteor.Collection("myCollection");
coll._createCappedCollection(numBytes, maxDocuments);