Is it okay to use this unusual method for pagination?

bluray · July 28, 2019, 6:15pm

So, I need to get the latest notifications, and I’ve been considering to store them like this:

{
    "_id" : "...",
    "userId" : "...",
    "date" : 1564333565,
    "type" : "FOLLOW",
    "notifierId" : "...",
    "index" : 4
}

With ‘index’ being just a counter.
Then I would do:
Notifications.rawCollection().createIndex({ index: 1 });

(I’m not indexing by ‘date’ because they can overlap)

The problem is that if I try to get the latest two, I get indexes 0 and 1.
db.getCollection(‘Notifications’).find({‘index’:{$lt:4}}).limit(2).explain()

That uses IXSCAN, which is good. But if I try to count from the last like this:

db.getCollection(‘Notifications’).find({‘index’:{$lt:4}}).limit(2).sort( { $natural:-1 } ).explain()

I get the correct results, but it uses a COLLSCAN.

So I thought, “instead of creating indexes and doing tricky commands, why don’t I just create blocks with ids?”

Example of NotificationBlocks:

{
	"_id" : "...",
        "userId" : "...",
	// holds 30 ids max
	"notificationIds":['rosJydD9woxo9tT7P', 'DuLQZBFmKwdfzdXGJ', 'ztjXH6TNYqGvQgh3G', ...],
	"prevBlock":'D43TZ3ih2h8PDDGFr',
	"nextBlock":'CNrAyc87yAzoHPps3'
}

That way, I just check the user profile to get the id of the latest block, then I can get the ids of the latest 30 notifications. If I need more, I check if there’s a previous block.

vigorwebsolutions · July 28, 2019, 9:38pm

I think in general if you are looking at creating a new way of solving an old problem, you’re probably going to run into the same scalability, reliability, or ease of use issues that have resulted in the tried and true methods that currently exist.

In the case of pagination, I think most would prescribe either using a cursor, or using a skip/offset.

It looks like you’re adding complexity to the cursor-based approach by creating these “blocks” of notifications?

I would strongly suggest either using the date information that is inherently stored in the _id or by using a createdAt timestamp and indexing that. While there may be “overlap”, you’re down to an index at the millisecond level, and if you fall back to a secondary sort by the _id property, then you’ll have consistent returns.

Whether you use a cursor based approach or offset depends more on your pagination implementation, I think. I tend to couple cursors with lazy load implementations and skip/offset with more traditional table pagination, but that’s just my preference.

bluray · July 28, 2019, 10:36pm

Thanks for your reply!

I’ve been avoiding skip/limit because I read that they don’t scale.
Eg, if you do “skip:60, limit:30”, the server builds a result with a billion documents (or as many documents as the ‘find’ pattern returns), and then does the skip/limit based on that result.

If you need to do a ‘find’ in the opposite direction (eg with notifications or messages), searching backwards is tricky, and does a collscan.

Using blocks, you instantly now the ids of the documents you need.

vigorwebsolutions · July 28, 2019, 11:21pm

I’m always a bit skeptical of advice that follow the pattern “don’t use X because of Y”. Many times, misuse of the tool can lead people to believe that it’s not performant, is prone to errors, etc., when in actuality, they just aren’t using it correctly.

I would agree that if you have a query that results in a billion documents, you’ll have an issue with response times, scaling, etc.

But fortunately for us, db tools like mongo are smart, and process things sequentially. So, even if you have a billion notification documents, if the first thing you do is query by the userId, you’re already down to the number of documents that match that user (lets say a few thousand to be safe).

Even a skip/limit based approach on a few thousand documents should be relatively quick.

You could also use a cursor-based approach, coupled with the userId filter that would be especially performant if you have good indexes on userId and something like createdAt.

mrrk · August 2, 2019, 10:25am

So with this query here:

db.getCollection(‘Notifications’).find({‘index’:{$lt:4}}).limit(2).sort( { $natural:-1 } ).explain()

The reason it’s using COLLSCAN is because that’s what $natural does. It uses a COLLSCAN to return documents in their ‘natural order’, which isn’t necessarily the order they were inserted in, if they’ve been updated since.

If you want to return items sorted by your index property, you need to sort on that. And if you’ve added an index to it, then it won’t need to COLLSCAN to do it.

db.getCollection(‘Notifications’).find({‘index’:{$lt:4}}).limit(2).sort( { index: -1 } ).explain()

uses IXSCAN, returns your last two.

The idea that skip/limit don’t scale isn’t true, either. As long as you’re indexing carefully, you don’t need to iterate the entire collection to pull results.