Optimizing subscriptions

The main screen of my app has infinite scroll. I use meteor methods for the majority of the document’s properties, but I subscribe to the reactive properties of the documents retrieved from the meteor method.

I accomplish this roughly as follows:

// ===== CLIENT =======
let docs = [];
let page = 0;
let subHandles = [];

loadMore() { // loads 10 more documents
  Meteor.call('getMore', { page }, (err, newDocs = []) => {
    const newDocIds = newDocs.map(({ _id }) => _id);
    docs = docs.concat(newDocs)
    subHandles.push(Meteor.subscribe('reactiveDataOnDocs', newDocIds));
    page += 1;
  });
}

// ===== SERVER =======
Meteor.publish('reactiveDataOnDocs', (docIds) => {
  return Documents.find({ _id: { $in: docIds } }, { fields: { reactiveProp: 1 });
});

Not all users see the same feed of documents. For example, user1 may fetch documents A, B, C, D, E, and user2 may fetch A, C, D, F, G.

Questions:

  1. Does it add more overhead to subscribe to each document individually? (e.g. does ten subscriptions for ten different documents use more memory than one subscriptions for those same ten documents)
  2. If user1 subscribes to documents [A, B, C, D, E], and user2 subscribes to [A, C, D, F, G], are A, C, D in memory once or twice?

I apologize if I’m not explaining this well, but I will clarify as needed

Unless I’m missing something it seems like it would be better to ditch the method call and just do a regular pub/sub that accepts a number argument. Increase that number when the user scrolls, and use that number parameter for a limit on how many documents you pull in.

Not readily an option for this. I do a lot of data crunching and joins, and I don’t want the entirety of every document that I know doesn’t change stuck in server-side memory.

Any thoughts on the questions from the OP?

First, some background on how publications work behind the hood:

When you create an “observe” (aka a function which watches the result of a particular database query), it will store the entire result set of that query in memory. Observes are multiplexed, so if you watch the exact same query twice it will only store the result set in memory once. If there are multiple observes which watch different queries, the observes are not multiplexed, so the result set for each observe is stored separately (even if the observes share a published document).

Publications that return a cursor create an observe that watch that cursor.

Separately, a copy of all the data published to each client is stored on the server. Let’s walk through a few examples.

For each of the following examples I assume that if multiple users subscribe to a publication P1, each user is supplying the same arguments to P1 and so P1 is watching the same cursor for both users. I’m also assuming that nothing is published to the client unless I explicitly mention it.

  • Suppose you have a single user and have a publication P1 that publishes A to the client. Then A will be stored in memory twice (once for the observe and once to mirror the user’s data).

  • Suppose you have two users and a publication P1 subscribed to by both users that publishes A. Then A will be stored in memory three times (once for the observe and once per user to mirror their data).

  • Suppose you have a single user and two publications P1, which publishes [A, B], and P2, which publishes [A, C]. Then A will be stored in memory 3 times (once for each publication and once to mirror the user’s data).

  • Suppose you have the previous example but with two users. Then A will be stored in memory 4 times (once per publication and once per user).

A couple more random facts:

  • Having more non-multiplexed observes does have some overhead, because it makes oplog tailing (or polling if you aren’t using the oplog) more expensive. This is somewhat less true if each publication is by _id.
  • Creating and destroying subscriptions is expensive if the result set is large.

So to directly answer your questions:

  1. A bunch of single document publications will be more expensive than a publication including all of those docs, unless the single document publications are by _id.
  2. With the two publications you mentioned ([A, B, C, D, E] to user #1 and [A, C, D, F, G] to user #2), A, C, and D will each be stored in memory 4 times, assuming those documents hadn’t already been published to the users via other publications. If you used individual document publications, then A, C, and D would each be stored in memory 3 times.

I think the easiest solution to this problem is just to create a publication that takes a page parameter and publishes the documents in chunks. So if you did Meteor.subscribe('myPub', 0) it would publish the first 10 docs (or however big you want to make the chunk), and then Meteor.subscribe('myPub', 1) would publish docs 11-20 etc.

You can do this using a combination of skip and limit in the cursor you return from the publication. Also make sure to add a sort to the cursor or the publication won’t be oplog tailable (VERY BAD).

14 Likes

I could hug you. This really helped my understanding.

My feed has several different modes. I tried to keep my example simple, but to fully illustrate my predicament I should explain that my app is a social app for tracking and ranking pinball scores. The score feed has many different options:

  • Featured (scores flagged to be featured)
  • Global Bests (only scores that are that user’s personal best on given pinball machine)
  • Social Bests (only scores that are a user I’m following’s personal best on given machine)
  • All Scores (all public scores)
  • Social Scores (only scores from users I’m following)
  • My Scores (a feed on my scores)
  • Location Feed (a feed of scores tagged with a particular location)
  • Event Feed (a feed of scores tagged with a particular event)

ALSO, it’s worth noting that each of these scores are ranked (and the rank is calculated when the score is retrieved). I use caching to keep the response time low. I can retrieve/rank 10 scores in <30ms on average. They are ranked:

  • globally against all other bests for that machine
  • socially against all other bests from people you follow for that machine
  • location/event against all other bests from that location/event for that machine

My decision right now is to keep doing them in tens like I’ve been doing.

Thanks again for your fantastic explanation

2 Likes

@veered Thanks for the enlightening post. Can you elaborate on why you’d need to add a sort to the cursor?

Queries with limit but no sort aren’t oplog tailable. Also, I took another look at the oplog docs and it appears that any queries using skip aren’t oplog tailable.

Here is a link to the (rather difficult to find) documentation on oplog tailing in Meteor: https://github.com/meteor/docs/blob/version-NEXT/long-form/oplog-observe-driver.md. I also wrote this package to help figure out which publications aren’t oplog tailable.

The reason why queries with a limit must always have a sort is a bit difficult to explain.

The oplog is a stream of individual document updates. If a single operation modifies 100 documents in the database, this will create 100 entries in the oplog. For a query to be oplog tailable, it must be possible to efficiently tell whether or not the result set of a query has changed by looking only at these individual updates, one by one.

Suppose a query has a limit of 10 and no specified sort. The query will return the first 10 matching documents sorted by the natural ordering. The natural ordering is an implementation detail of Mongo, and completely opaque to Meteor. So from Meteor’s perspective the query will just return 10 random documents.

Now suppose we insert a new document matching the query. When Meteor sees this update it has no way of knowing whether or not Mongo would consider this document to be in the first 10 documents according to the natural ordering. So it has to re-poll the whole query.

If the query does have a sort then Meteor can be much smarter. It can look at the individual update coming through the oplog and see if the corresponding document belongs in the first 10 documents. If it doesn’t, then no need to re-poll.

However, I think that Meteor should still do oplog tailing even when there is a limit without a sort. When there is a limit, having a sort is definitely more efficient than not having a sort, but oplog tailing would still be better than polling even when there is no sort.

The only reason I can think of why Meteor may require the sort in the presence of a limit is if the natural ordering isn’t assumed to be stable. If it could change at any time I can see why polling would be necessary (since the result set could change at any time).

Anyway, if you have a limit and don’t care about the sort order then just sort by _id. The _id field has a built-in index so it’s free.

7 Likes

When using mongo, you should have a single document that corresponds to each page of data you want to show. Every single performant solution to paging eventually reduces to this, in the history of the world.

Save the “pages” for common filters and queries. Otherwise, filter on the pages and communicate an inaccurate page count to the end user. Google does this. After all, people rarely look past the first page.

You make a subscription with a variable that is sent for the page number, this you then translate into the .skip parameter - code example below. Just turn this into your subscription and it’s all good to go with meteor and react’s reactivity it works nicely. I use React’s intersectional observer to trigger the change and a state var for the actual page number.

function printStudents(pageNumber, nPerPage) {
  print( "Page: " + pageNumber );
  db.students.find()
             .sort( { _id: 1 } )
             .skip( pageNumber > 0 ? ( ( pageNumber - 1 ) * nPerPage ) : 0 )
             .limit( nPerPage )
             .forEach( student => {
               print( student.name );
             } );
}