Sitemap on large collection

jasongrishkoff · April 5, 2021, 6:17pm

Hey folks, hoping someone can point me toward best practices here. I’ve done a bit of poking through the forum’s history, but none seem to address my particular problem: generating sitemaps dynamically for a large collection (500,000 documents).

I’m using this package to generate my sitemaps. It works well, though they do suggest switching to static sitemaps for larger collections.

My strategy is to chunk the sitemaps into sets of 1,000 to try and reduce the database query. I know sitemaps can be much larger, but pulling 10,000 documents in one request is pretty hefty.

Problem: The main issue with this chunking strategy mentioned above is that if skip = 200,000, the mongodb query appears to “examine” 201,000 keys to arrive at the batch I need. And that leads to query times approaching 1000ms.

Example code below:

Index: Collection._ensureIndex({'edited':1,slug:1}) ← working as intended

for (var i = 0; i < 500; i++) {
    sitemaps.add('/sitemap-' + i + '.xml', function(req) {
        var urls = []
        var items = Collection.find({edited:true},{ 
            sort:{slug:1},
            limit:1000,
            skip:i * 1000,
            fields:{slug:1,date:1},
        }).fetch()
        _.each(items, function(item) {
            var date = artist.date
            urls.push({
                page: 'https://www.example.com/item/' + item.slug,
                lastmod: item.date,
                priority: 0.6
            })
        })
        return urls
    })
}

Any tips on how to make this more-efficient? Should I switch from dynamic on-request results to something static/cached?

znewsham · April 5, 2021, 6:25pm

Can you post the result of an explain call where your skip is 200k?

You may need to provide a hint to mongo as to which index to use - it doesn’t love boolean fields, if the majority of your documents are true it won’t know that the index is useful.

A couple other things, if you include date in your index and exclude _id in your projection you can use a covered query which is much faster

You should also absolutely use a cache, either generate the files lazily on first request, and retain the file for a set period, or detect that a specific batch has changed and regenerate then (e.g., with an observe)

jasongrishkoff · April 5, 2021, 6:45pm

Sure, here are the juicy bits from an explain('executionStats') on Atlas. Seems like the index is working fine. A covered query means it’s just returning results from the index and not documents itself, is that right?

{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "test.collection",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "$and" : [
                                {
                                        "edited" : {
                                                "$eq" : true
                                        }
                                },
                        ]
                },
                "winningPlan" : {
                        "stage" : "PROJECTION_SIMPLE",
                        "transformBy" : {
                                "limit" : 1000,
                                "skip" : 200000
                        },
                        "inputStage" : {
                                "stage" : "FETCH",
                                "inputStage" : {
                                        "stage" : "IXSCAN",
                                        "keyPattern" : {
                                                "edited" : 1,
                                                "slug" : 1
                                        },
                                        "indexName" : "edited_1_slug_1",
                                        "isMultiKey" : false,
                                        "multiKeyPaths" : {
                                                "edited" : [ ],
                                                "slug" : [ ]
                                        },
                                        "isUnique" : false,
                                        "isSparse" : false,
                                        "isPartial" : false,
                                        "indexVersion" : 2,
                                        "direction" : "forward",
                                        "indexBounds" : {
                                                "edited" : [
                                                        "[true, true]"
                                                ],
                                                "slug" : [
                                                        "[MinKey, MaxKey]"
                                                ]
                                        }
                                }
                        }
                },
        },
        "executionStats" : {
                "executionSuccess" : true,
                "nReturned" : 266418,
                "executionTimeMillis" : 1145,
                "totalKeysExamined" : 266419,
                "totalDocsExamined" : 266418,
                "executionStages" : {
                        "stage" : "PROJECTION_SIMPLE",
                        "nReturned" : 266418,
                        "executionTimeMillisEstimate" : 279,
                        "works" : 266420,
                        "advanced" : 266418,
                        "needTime" : 1,
                        "needYield" : 0,
                        "saveState" : 2082,
                        "restoreState" : 2082,
                        "isEOF" : 1,
                        "transformBy" : {
                                "limit" : 1000,
                                "skip" : 200000
                        },
                        "inputStage" : {
                                "stage" : "FETCH",
                                "nReturned" : 266418,
                                "executionTimeMillisEstimate" : 215,
                                "works" : 266420,
                                "advanced" : 266418,
                                "needTime" : 1,
                                "needYield" : 0,
                                "saveState" : 2082,
                                "restoreState" : 2082,
                                "isEOF" : 1,
                                "docsExamined" : 266418,
                                "alreadyHasObj" : 0,
                                "inputStage" : {
                                        "stage" : "IXSCAN",
                                        "nReturned" : 266418,
                                        "executionTimeMillisEstimate" : 65,
                                        "works" : 266420,
                                        "advanced" : 266418,
                                        "needTime" : 1,
                                        "needYield" : 0,
                                        "saveState" : 2082,
                                        "restoreState" : 2082,
                                        "isEOF" : 1,
                                        "keyPattern" : {
                                                "edited" : 1,
                                                "slug" : 1
                                        },
                                        "indexName" : "edited_1_slug_1",
                                        "isMultiKey" : false,
                                        "multiKeyPaths" : {
                                                "edited" : [ ],
                                                "slug" : [ ]
                                        },
                                        "isUnique" : false,
                                        "isSparse" : false,
                                        "isPartial" : false,
                                        "indexVersion" : 2,
                                        "direction" : "forward",
                                        "indexBounds" : {
                                                "edited" : [
                                                        "[true, true]"
                                                ],
                                                "slug" : [
                                                        "[MinKey, MaxKey]"
                                                ]
                                        },
                                        "keysExamined" : 266419,
                                        "seeks" : 2,
                                        "dupsTested" : 0,
                                        "dupsDropped" : 0
                                }
                        }
                }
        },
        "serverInfo" : {
             // redacted
        },
        "ok" : 1,
        "$clusterTime" : {
             // redacted
        },
        "operationTime" : Timestamp(1617648048, 1)
}

znewsham · April 5, 2021, 6:46pm

It does look like it’s using the index, I did have problems where mongo refused to use the right index and explain reported that it was when it wasn’t. Did you give a sort to this query you’re explaining?

Yeah a covered query will skip the fetch stage

peterfkruger · April 5, 2021, 6:52pm

We created our own sitemap generator that runs as a microservice in Node.js. However, having just read what meteor-sitemaps can do, there is no mentioning of the size limits imposed on site.xml – at least by Google:

All formats limit a single sitemap to 50MB (uncompressed) and 50,000 URLs.

…so make sure this package does support it, otherwise your sitemap of 500k entries will not work with Google. (See Build and submit a sitemap)

As already hinted, you should probably generate the sitemap once in a while and cache it until you generate a new one. Doing it on the fly doesn’t sound like a good idea, especially not if it needs to be that big. We cache the sitemap file(s) in redis, where the size limit of each entry is 512MB. Delivery is directly via nginx from redis, so our Meteor application is not even affected.

Beyond the scope of the mere generation of the sitemap, also keep in mind that Google won’t index that many pages of your app, unless you’re writing the next Facebook or cnn.com or whatever else at that magnitude of importance.

jasongrishkoff · April 5, 2021, 7:00pm

Unfortunately I’m not indexing the date. Womp womp. Cool to learn about this type of query, though. Didn’t know it existed.

They actually get split into separate sitemaps of 1,000 entries each, so nowhere near Google’s limit.

Good point. I can see Google has indexed about 150,000 of these URLs (which ain’t all of them).

znewsham · April 5, 2021, 7:30pm

If you only ever care about edited you can easily modify your index:

Collection._ensureIndex(
  { 'edited':1, slug:1, date: 1 },
  { partialFilterExpression: { edited: true } }
)

this ensures the index is small and very useful - of course this only really makes sense if this is the only place you use this index (or you always included { edited: true }).

Regarding explain - ensure that you’re always explaining the query exactly as you would run it - it looks like you’re missing the sort on your explain - you should have a sortPattern: ....

I mention using a hint because mongo does sometimes lie in explain, and even if you have an index that by all logic should be perfect for your query - mongo won’t always use it if you have an unusual distribution of data in your dataset, e.g., roughly half of your entries match { edited: true }. For example, in my application, I query over un-processed emails, without a hint mongo crashes with OOM, with a hint, mongo finishes fast - with totalKeysExamined: 13370 and totalDocsExamined: 100 (which is the page length). This tells me that mongo isn’t able to resolve your query by index alone.

rjdavid · April 5, 2021, 11:22pm

Why not use a mongodb cursor and let mongodb handle the “chunks” by itself? The ideal batch size for mongodb I think is around 100 docs per query but instead of worrying about the batch size, let the cursor do it for you.

Have an external counter that can tell your code to move to the next sitemap when it reaches your limit e.g. 1000 (move to new sitemap and set counter back to zero). This limit no longer affects your query so you can set it to the actual limit of 10,000

radekmie · April 6, 2021, 7:11am

I’ve had a similar problem some time ago. It wasn’t related to sitemaps at all, but rather large exports – hundreds of megabytes and more.

First of all, @znewsham comments about indexing are the ones you should focus on at first. Not only in this manner but IMO always. (Actually, I recommend running explain for every non-trivial query.) One more thing here: if your query won’t be covered by the index (for any reason), remember that an index doesn’t have to include all of its partialFilterExpression dependencies:

// This index will work for your query, but won't cover it.
Collection._ensureIndex({ slug:1, date: 1 }, { partialFilterExpression: { edited: true } })

Secondly, I agree with @rjdavid. Instead of rolling your own offset-based pagination, just create a cursor and go through it. In my exporting case, I’ve opened an S3 stream and streamed all of the data: cursor → mapping function → S3 stream. It offloaded the pressure from the server almost completely, as the mapping logic was fairly easy (and there’s basically none in your case). Then, having all of the sitemaps on S3, you can easily serve them straight away, without holding all of them in memory at once.

jasongrishkoff · April 6, 2021, 8:21am

Totally – I’ve definitely confirmed the indexing is all good. Going to try with the PROJECTION_COVERED method as well as partialFilterExpression.

I’m not entirely sure how to do this. Would that basically be var items = Collection.find({edited:true},{fields:{...}}) without the fetch() at the end, followed by items.map((item) => {...})? And then you upload the results to S3 and somehow serve that when someone hits the sitemap? To be honest I don’t even fully understand the merits of using a cursor instead of fetch!

radekmie · April 6, 2021, 8:49am

In our case, we’ve produced a single file – it was easier. To actually stream it, we’ve used actual Node.js streams, as MongoDB cursors are streams. Basically, it looked like this:

import AWS from 'aws-sdk';
import { promisify } from 'util';

export const s3 = new AWS.S3();
export const uploadPromise = promisify(s3.upload);

// Later...
// Note: `rawCollection` is needed as Meteor cursor wrappers are NOT streams.
const cursor = Collection.rawCollection().find({ ... }, { ... });
const transform = getStreamTransformer();
const { Location: fileUrl } = await uploadPromise({
  // Other S3 options...
  Body: cursor.pipe(transform)
});

Where getStreamTransformer is the Transform for example. (We’ve used ParallelTransform from the pipeline-pipe package). In your case, it’d be far easier to aggregate chunks in memory:

let urls = [];
const sitemaps = [];
Collection.find({ ... }, { ... }).forEach(document => {
  urls.push(getSitemapEntry(document));
  if (urls.length === URLS_PER_SITEMAP) {
    sitemaps.push(getSitemapFromURLsAndStoreItInS3(urls));
    urls = [];
  }
});

const sitemapsURLs = await Promise.all(sitemaps);

And when it comes to serving it, I believe you could redirect all of your sitemap requests directly to a public S3 bucket.

rjdavid · April 6, 2021, 2:07pm

We are using this npm package to stream our sitemaps to S3: sitemap - npm

corentinv · April 12, 2021, 5:08pm

Are you sure it will benefit you to index all these pages ? Won’t it burn down all your crawl credit and dilute your SEO ? We also have such a very large number of pages, and since last Google core update, we desindexed nearly all of them to only index the high SEO value ones.