Best way to model time series data for something like youtube views?


#1

Hi, this is the first time I am learning to design time series data like play counts for videos.

For example this data should allow user to query for the # of plays per day and view it in a 30-day/month chart.

My question is what is the optimal resolution i should save this data as? One doc per play event? One doc per day? One doc per month?

I think at the scale of youtube, storing one doc per play will be way to much. So i currently have something like this:

{
    videoId: objectId(),
    month: "201508",
    days:{ 
        //counts for each day of august
        1: 0,
        2: 0,
        3: 0,
        .
        .
        31:0
    }
}

This makes it easy to query for data this month and put it into a monthly chart, but what about query for a time range lets say 08/19 to 09/16 ?

Does anyone have any experience doing something like this before? Would love to have your input, thank you!


#2

It highly depends on your data and on your scale. Since these things are variable, just keep track of the exact event - don’t prematurely compress your data because it could bite you down the road if you need higher resolution. If you are 100% sure that you won’t need temporal accuracy of greater than a day (i.e. likes per hour), then by all means compress.

It seems like you aren’t 100% sure, so just be safe and don’t compress your data.

Other words: instead of prematurely defining your bucket size (month, day, year, hour, etc.), you could keep the exact time (e.g. 1441573466851) at which each event occurs. This way, you can use any level of accuracy you want down the road and are not constrained. Then, you can perform any calculation you want on this array. For example, find the number of likes between 1441573466851 to 1441580000000 becomes a trivial calculation, versus dealing with some day/month pattern which adds extra nuances.

At scale, this would be overkill. But you can always compress your data when you start to achieve success. Once it’s been compressed, you lose the higher resolution information.

When in doubt - don’t throw away data.


#3

If i am 100% sure i wont need anything greater resolution than daily(for # of plays count, likes and all that are all stored in its own thing), should i do it in monthly buckets or one doc per video per day? Thanks for your help.


#4

For something like 8/16 to 9/19 why can’t you just query both 201508 and 201509, using something like $in? Sure you end up with more data than you needed, but not much more. Or, you could force the user to certain intervals – which isn’t a very bad UX depending on your userbase, as long as you provide a couple different scale options.

But if your minimum time period is a day, why not do it by day?


#5

It sounds as if you do not know what you are doing from the product point of view, so you are wanting to do everything - just in case that you needed to do it. :wink: Having said that, that rudely, let me fess up and also say that this is precisely what all of us are wrestling with every day. My advice for cases like this would be to do both. Use granular data to answer the last X number of days questions and then roll up that into monthly summaries. Knowing what happened in the past couple of hours is likely to be a legitimate requirement but beyond 30 days, who really cares as to what happened on a particular day between 2 and 3 AM. Sure, today we may discover that way back when something exceptional happened but those are often administrative types of edge cases and not something that anyone cares about with regards to videos that they are watching. HTH. :wink: