Meteor+MongoDb for large scale of data

I am going to build a very huge application.
For which I need to decide how should be the architecture & which technologies I should use.

  • The current size of database on SQL is 3.1 TB.
  • I will be receiving around 1000 records per minute.
  • Yearly detailed statistics reports will be generated on dashboards.
  • Around 10k concurrent users.

My personal specialties are in these technologies:

  • Meteor
  • Mongodb
  • Nodejs
  • Php
  • Mysql

What do you guys recommend for such sort of project? What technology should I choose and how should be the architecture.

Thanks in advance.

What will the usual access patterns be? For example:

  1. How long will you retain data
  2. Will access generally be local - you mention annual reports, will it always (or often) be for the most recent year?
  3. Are there distinct sets of data, e.g., will some subset of users only have access to some subset of the data?
  4. How large will the individual records be and how many total records do you have currently?

The answers to these questions will inform the remainder of the conversation, but in general Mongo will probably serve you well, though you may need to look into a sharded cluster. Meteor works just fine on a sharded collection, but the standard oplog won’t, best to use redis-oplog, or no oplog at all.

I’d say you almost certainly want the data you describe to be in either a separate cluster, or at minimum in a separate set of shards from your “app” data (e.g., users) as the access patterns will likely be massively different. It might even be worth separating out the access of the data to an entirely separate set of servers - that way your UI is always snappy even if users are hammering the data.

1 Like

Thanks for the response!

Actually the data is based on tenants!
Each tenant has its own data.
I have around 2 Million records for each tenant.
The data that is assigned to a specific tenant can not be accessed by any other tenant.
There are no distinct datasets.
The data can be access for any year that has been selected to generate reports.
Continuous rankings and variance calculations will be performed on that datasets

There are no distinct datasets.

and

Each tenant has its own data.

seem to conflict? If each tenant has it’s own dataset, then you do have distinct datasets - meaning you never need to query across the entire collection of records? The reason this is important is that sharding works best in circumstances like this - particularly if your access patterns dont exhibit timing based locality.

You can look into mongo’s sharding further, but essentially you’d probably want to shard on whatever your tenancyID is, then ensure you always use that in your query - data for different tenancies will be distributed at random across the replica sets - its quite a bit more complicated than this (with 2 million records per tenancy you’ll almost certainly want to shard on tenancyID + something) but this gives you a start for sure.

The data can be access for any year that has been selected to generate reports.

Will the user always select a specific year? Or can queries span multiple years - if multiple, will they always be adjacent? Depending on your answers to these questions, year or possibly date/time might make a good component to your shard key.

Atlas’ hosting gets pretty expensive as you move towards sharded clusters, but it’s fairly easy to self-host mongo in AWS (and presumably other cloud providers too).

Continuous rankings and variance calculations will be performed on that datasets

How continuous? On every update, or hourly/daily? The former could get quite cumbersome - depending on how many groupings of rankings you need (total, hourly, daily, etc) - it can make sense to store additional data necessary so you can just apply a delta, rather than re-querying all the data, e.g., instead of storing just the average, store the average + count. This way you can compute a weighted average using just those two values for the old set, and the new set.

One last question is what’s the distribution of your updates? You mentioned around 1k per minute, will those be across all your tenancies? Or will you have some “hot” tenancies?

1 Like

Actually tenant here means organizations.
E.g
Organization A
Organization B
Organization C

Now logic for all the organization is totally same but with unique datasets.
They will get around 1k records per minute as per business logic.

For example:

  • We have a weather data that is being pinged each and every minute.
    Or
  • We have some stock exchange metric values being received each n every minute.

So we inform each and every user with the type of weather report or some sort of specific stocks and variance, we sort them by rankings etc and send it to user. Then a user can deep dive into the details further…
Now we will send that data to users on hourly basis.l for now.

Please let me know if I can help you with more info.
Thanks

Mongo will perform well if you use insertMany not updateMany, so bare that in mind. It’s best to do bulk inserts or removes, updates is slow as hell. If you have to use update do it in a bulk operation and while the server is idle at like 1am or whenever the low usage time is. Inserting 1000 a minute will be no problem, you can actually do several magnitudes more, but updating will cause high load and require a dedicated db server with around 32gb ram and 8 to 12 cores to keep up.

It will be easy to build in Meteor I recommend you go from a starter app, using React is my preference but each to their own. React will mean you can update in realtime which may be very nice for a stats app where data can be updating in browser without refresh. I wish you the best of luck with your project

1 Like

Seems like a good use case for existing tools like kibana.

1 Like

To get a good answer it’s important to specify how many of your connections are reads and how many are writes. We used to have a tracking system and saved a lot of processing power by only allowing to read delayed information.

If you need concurrent reads and writes to the same data with live updates then use SQL if you have a strict data structure and NoSQL if you need more flexibility, I’d say.
You’d just need a server/cloud service strong enough to handle the load, and you’d definitely need a good database person for the fine-tuning.

Or better yet consider some other databases or tools specifically for tasks like these, such as

for example…

I also had similar issues and persisted on with Mongo as benchmarking showed it slightly outperform MySQL (I am a big MySQL fan for over a decade but wanted to try something new) and my App’s Mongo is currently working now with a dataset of about 4.5 million records per hour because I have used bulkoperations and use insertMany and do not use update or upsert anymore. There was a significant performance difference. Bulk ops and insertMany and deleteMany are very fast when using an indexed field for the find in the query (e.g deleteMany where date is older then x minutes)

The bulk command is also very easy to write in an iterative method as it’s just json, that part was kind of cool to see.

So what do you prefer if I need to update multiple records?

When I need to update multiple records I use a bulk of 100 entries doing single .update({data}) with {upsert: true} set and that does spike cpu while it processes but is ok if you have a 8 or 16 core as it’s only momentarily whilst processing. You can space out these bulks also and allow one to finish before starting the next. It seems to be able to handle this. For a higher load you’ll need to implement a queue with kafka or beanstalkd so that it sequentially handles updates one by one instead of clobbering the mongod process.