Best way to do a large publication?

theeagle · September 15, 2016, 2:41pm

I have documents that look like this:

{
    "_id" : "cgR25FAxb3tbYShjN",
    "owner" : "6mQKnNnwkQYSaaAMr",
    "username" : "admin",
    "filename" : "5sQqwbDFBcZZ.png",
    "title" : "Title",
    "points" : 2,
    "loc" : {
        "type" : "Point",
        "coordinates" : [ 
            -119.981134343, 
            20.635934343
        ]
    },
    "createdAt" : ISODate("2016-09-15T10:27:20.956Z")
}

I want to search for the closest 100 documents, then sort by either points or createdAt. I want these documents to be returned to the client in an ‘infinite scroll’ style. I can get those closest 100 documents by this query:

Photos.find({loc: {$near: { $geometry: {type: "Point", coordinates: [lng, lat]}}}}, {limit: 100});

The Problem:

If I do the standard ‘infinite scroll’ that Meteor suggests, it’s going to be troublesome to sort on the client. It will sort the first 20 (or however many you request at a time) documents by points, but then the second 20 will be either sorted in its own order, or documents will pop in at places the user doesn’t expect.

I don’t think that passing all 100 documents to the client at once is a good idea either.

Is it possible to make a ‘custom’ publication? For example, create an array of 100 documents from the query above, then return 20 items of the array at a time. If I did it this way does that mean I would be storing 100 documents in memory for each user? That doesn’t seem optimal either.

Thoughts? Thanks!!

robfallows · September 15, 2016, 2:46pm

The MongoDB docs say that using $near without a separate sort returns the documents in distance order. That may be sufficient for a jank-free infinite scroll.

theeagle · September 15, 2016, 3:25pm

That’s correct, but that would display the documents in order of distance. I want to display them in order of points. If I even sort it by points on the Mongo query, it would change every time you wanted to load more documents.
For example: I originally pull 20 documents that are sorted by points. Then when I click Load More, it changes to number to 40. If there was a document in that second 20 that had the highest number of points it would be put at the top and the user would be farther down the screen. So the user would never see that one and the documents would be shifting all the time.

robfallows · September 15, 2016, 3:31pm

But surely the complete cursor will be sorted by points, not each 20 document chunk (I’m assuming points is a field in the document, so can be used for the sort).

theeagle · September 15, 2016, 5:25pm

Yes, points is a field.

I’m publishing like so:

Meteor.publish("photos", function (lat, lng, limit) {
	return Photos.find({loc: {$near: { $geometry: {type: "Point", coordinates: [lng, lat]}}}}, {limit: limit});
});

So when a page first loads, the limit variable is set to 20. Once the ‘Load More’ button is clicked, it increases the limit variable by 20, making it 40.

How would Mongo know to do the sorting by 100 documents when it’s only getting passed 20, 40, 60, etc as limits? Therefore, each time the limit increases, it changes the sort order because there is a varying amount of points in the new documents.

robfallows · September 15, 2016, 6:24pm

Because the sort is done before the limit.

theeagle · September 15, 2016, 6:36pm

So you are saying if I have a collection with 10,000 documents in it and I query it like this:

Photos.find({loc: {$near: { $geometry: {type: "Point", coordinates: [lng, lat]}}}}, {sort: {points: -1}}, {limit: limit});

It will sort all 10,000 documents then return the first 20 or 40, etc based on the points?

I guess I misunderstood the order. I thought Mongo sorts after the limit.

theeagle · September 15, 2016, 7:13pm

Maybe I’m misunderstanding, but it seems like using a sort on $near basically overrides the documents being the closest ones.

sort() re-orders the matching documents, effectively overriding the sort operation already performed by $near.

https://docs.mongodb.com/manual/reference/operator/query/near/

As the docs state, its basically like using $geoWithin.

robfallows · September 16, 2016, 8:59am

Well, yes, but isn’t that unavoidable?

Unless you want a compound sort (points/near)?

streemo · September 19, 2016, 8:30am

TLDR: Avoid limit and skip whenever possible. Make use of $minDistance together with $maxDistance to achieve pagination for geoqueries, if the page count can be flexible. Never use sort with $near, you’ll waste CPU. Sort on the client, whenever possible, refrain from ever sorting on the server, unless you are trying to implement some sort of pagination.

//client
import { GeoMath } from "meteor/streemo:geomath";
import { Geolocation } from "meteor/mdg:geolocation";
import { Mongo } from "meteor/mongo";

const coords = Geolocation.latLng();
Meteor.subscribe('photos', coords, 0, 1); //minDist is 0, page is 1.

const TransformedPhotos = new Mongo.Collection(null);

Photos.find({},{fields:{loc:1}}).observeChanges({
  added: function(id, fields){
    const dist = GeoMath.distance(coords, GeoMath.toCanonicalCoords(fields.loc))
    fields.distanceFromMe = dist;
    TransformedPhotos.insert({_id: id, ...fields})
  },
  changed: function(id, fields){
    TransformedPhotos.update(id, {$set:fields})
  },
  removed: function(id){
    TransformedPhotos.remove(id);
  }
})

//later
TransformedPhotos.find({},{sort:{distanceFromMe:1}})

Make use of $near's $minDistance specifier for pagination.

//client
const page = 2;
const lastDoc = TransformedPhotos.findOne({},{sort:{distanceFromMe:-1}});
const minDistFromMe = GeoMath.distance(coords, GeoMath.toCanonicalCoords(lastDoc));
Meteor.subscribe('photos', coords, minDistFromMe, page)

//server
Meteor.publish(function(coords, minDist, page){
  const lim = page*20
  return Photos.find({},{
    $near:{
      $geometry:{type:"Point", coordinates:[coords.lng,coords.lat]}, 
      $minDistance: minDist
    }
  }, {limit:lim, skip:lim-20})
})

Avoid limit and skip whenever possible. Make use of $minDistance together with $maxDistance to achieve pagination for geoqueries. The above code assumes you need exact page count pagination. Instead, why not provide the user with photo-distance regions?

Click to see photos within 10 miles
Click to see photos beyond 10 miles, closer than 50 miles.
Click to see photos beyond 50 miles, closer than 100 miles.
…

Then, you can avoid the limit/skip mess. You will probably still need to sort on the client, since your $near sort on the server has no effect on your queries in the client side … all it does it help your server get relevant documents to the client.

theeagle · September 19, 2016, 7:04pm

Thanks for all that info!

I think you are right about the regions. I’m thinking I’m going to adjust my query to utilizes maxDistance and minDistance. I’m going to try a few things and see what I like the best.

theeagle · September 21, 2016, 3:37am

Why do you suggest not using limit? What if I have 100 documents in between minDistance: 10 and maxDistance: 50. I don’t want to publish all 100 documents at once. I should use limit to get them in smaller sets, like 20 at a time. Is this not ok for this query?

streemo · September 22, 2016, 3:47am

Limit is fine, but apparently the implementation for $skip (which you will have to use in conjunction with limit) is a bit shoddy.

Another thing you could do is keep a cache of your document surface density, σ(x,y), then calculate as a function of spatial coordinate the values for $minDistance and $maxDistance from a point (x,y) required in order to return 20 documents. For example, if your user at (x_0,y_0) and σ(x_0,y_0) is 80 documents / km^2, then you might want to let the user search in units of 1/2 km. Of course, you cannot find the value of σ(x,y) everywhere, you’ll have to bin that in some minimum units.

One other option is to apply this algorithm: First query for all documents between minDistance = 0, maxDistance = r, and if that number if less than 20, fetch the next interval: minDistance = 0, maxDistance = f®, where f® is at least monotonically increasing (feel free to choose your own function). Keep doing it until you have at least 20 documents, then record the value of r. When you want to fetch your next ~20 documents, assuming the surface density varies greatly, set minDistance to r, then apply the same algorithm. If you believe that the surface density is constant, you may just immediately set minDistance to r, and then maxDistance to sqrt(2)*r.