There are no distinct datasets.
and
Each tenant has its own data.
seem to conflict? If each tenant has it’s own dataset, then you do have distinct datasets - meaning you never need to query across the entire collection of records? The reason this is important is that sharding works best in circumstances like this - particularly if your access patterns dont exhibit timing based locality.
You can look into mongo’s sharding further, but essentially you’d probably want to shard on whatever your tenancyID is, then ensure you always use that in your query - data for different tenancies will be distributed at random across the replica sets - its quite a bit more complicated than this (with 2 million records per tenancy you’ll almost certainly want to shard on tenancyID + something) but this gives you a start for sure.
The data can be access for any year that has been selected to generate reports.
Will the user always select a specific year? Or can queries span multiple years - if multiple, will they always be adjacent? Depending on your answers to these questions, year or possibly date/time might make a good component to your shard key.
Atlas’ hosting gets pretty expensive as you move towards sharded clusters, but it’s fairly easy to self-host mongo in AWS (and presumably other cloud providers too).
Continuous rankings and variance calculations will be performed on that datasets
How continuous? On every update, or hourly/daily? The former could get quite cumbersome - depending on how many groupings of rankings you need (total, hourly, daily, etc) - it can make sense to store additional data necessary so you can just apply a delta, rather than re-querying all the data, e.g., instead of storing just the average, store the average + count. This way you can compute a weighted average using just those two values for the old set, and the new set.
One last question is what’s the distribution of your updates? You mentioned around 1k per minute, will those be across all your tenancies? Or will you have some “hot” tenancies?