Strategies to create a very large sitemap

I am trying to create a very large sitemap for a website with close to 500k pages. I am trying to figure out the proper way to create a dynamic sitemap that updates frequently.

I’ve checked out meteor-sitemaps, but that doesn’t seem to work because there are too many pages which will cause memory problems. I want to use sitemap.js and create a new sitemap every week with a cron job, but I am unsure how to store this at a url. How would I go about saving this to the ./public folder in my project? Or is there a different approach I need to take?

Any help or ideas would be great!

I wrote some code to do the same thing for my site with a few tens of thousands of pages. Basically a function runs and writes a new sitemap for each of the companies that I have data for when one of the apps that handles data starts up.

You can upload to S3 but what may be better is to upload the sitemap to your actual site, at least temporarily to point Google directly at it using the console where you can specify the url (that includes your site’s domain only, unfortunately). A 500k loc sitemap would be way too big of a file for me to include in each client permanently.

You could also serve your huge sitemap from nginx and check the headers for bots (google around for this there are quite a few posts out there about this), that way your Meteor app doesn’t respond to them nor do you serve your huge sitemap to your users.

Here is a function I write that composes a single sitemap entry (and uses moment.js). It sets lastMod to the day before. Hope it helps. This is thus far untested so maybe write a few tests for it but it’s a decent start.

/**
*

  • @param pre: url that comes before a variable for each loc
  • @param variable: probably data from database; what is forEach’ed through
  • @param post: part of url that comes after some data
  • Example: https://somehost.com + ‘/’ + pre + variable + '/post
  • @returns a complete, single sitemap entry
  • Excludes sitemap header and ending fs.appendFileSync(siteMap, "</urlset>");
    */
function composeSiteMapEntry(pre, variable, post) {
  if (/[^a-z]/i.test(variable)) return false; // if the variable contains anything non-alphabetical return; handle url encoding later
  let entry = '<url>';
  const host = 'https://stockbase.com'; // no end slash here to have it on both the pre and post args below
  const changeFreq = '<changefreq>daily</changefreq>';
  const lastMod = '<lastmod>' + moment().subtract(1, 'days').format('YYYY-MM-DD') + '</lastmod>';
  const loc = '<loc>' + host + pre + '/' + variable + post + '</loc>';
  entry += loc + changeFreq + lastMod + '</url>\n';
  return entry;
}

I do this before:

fs.writeFileSync(siteMapFile, '');  // Think we need this to erase everything
  const header = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9' +
                  ' xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance ' +
                  'xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">\n';
  fs.appendFileSync(siteMapFile, header);
2 Likes

sitemap.js has a handy cache for that. Just use cacheTime: 1000 * 60 * 60 * 24 * 7 as its option and you’re good to go with a weekly cached version of you sitemap :wink:

3 Likes