Merge box deep merge

znewsham · August 31, 2019, 3:58pm

We’ve been playing around with the ddp-server package the last week and have made some changes to the way mergeBox works in some cases.

TL; DR - what are the performance costs of making the mergebox optionally “deep merge” as described below, and are there any edge cases:

A little background

In many of our collections we have a schema similar in structure to this:

{
   ...
   location: {
      state: String,
      zip: String,
      city: String
     ...
   },
   ...
}

In our site we allow users to choose which columns appear in a table, and allow them to export data from those tables (and choose different fields). In the case that the user has location.state in their table, and location.zip (or any other location property) in their export, the new property does not get sent to the client, as merge box believes that the client has the entire location object. There are many other more common scenarios where this occurs.

Until now, we’ve handled this by simply changing the subscription of these nested fields to use the top level accessor - and accepting the small overhead that comes with sending the entire object when it is not required.

We recently introduced the concept of “flex data” to our application, which allows different clients to specify arbitrary fields they want to track - to ensure no namespace collisions, we put all this data under the flex namespace, e.g., flex.myCustomField, flex.anotherCustomField, etc. (We could probably have made this flex_myCustomField, but we didn’t and aren’t keen on a migration at this point). These flex objects can be extremely large in some cases (e.g., where the entity came from a CSV import with 100 columns of data associated with it).

Our implementation

In a publication that you want to enable deep merge for, you say this.deepMerge = true - you can of course have a condition, that checks whether the publication will send data to the client that might need a deep merge.

The main change to ddp-server is in the changeField function, which I’ve provided below:

  changeField: function (subscriptionHandle, key, value,
                         changeCollector, isAdd, deepMerge) {
    var self = this;
    // Publish API ignores _id if present in fields
    if (key === "_id")
      return;

    // Don't share state with the data passed in by the user.
    value = EJSON.clone(value);

    if (!_.has(self.dataByKey, key)) {
      const precedenceListItem = {subscriptionHandle: subscriptionHandle,
                              value: value};
      if (deepMerge) {
        precedenceListItem.deepMerge = true;
      }
      self.dataByKey[key] = [precedenceListItem];
      changeCollector[key] = value;
      return;
    }
    else if (deepMerge && !EJSON.equals(value, self.dataByKey[key][0].value)) {
      const mergedValue = _.deepExtend(self.dataByKey[key][0].value, value);
      const precedenceList = self.dataByKey[key];
      for (let i = 0; i < precedenceList.length; i++) {
        precedenceList[i].value = mergedValue;
      }
      precedenceList.push({subscriptionHandle: subscriptionHandle,
                              value: mergedValue, deepMerge: true});
      changeCollector[key] = mergedValue;
      return;
    }
    var precedenceList = self.dataByKey[key];
    var elt;
    if (!isAdd) {
      elt = _.find(precedenceList, function (precedence) {
        return precedence.subscriptionHandle === subscriptionHandle;
      });
    }

    if (elt) {
      if ((elt === precedenceList[0] || elt.deepMerge) && !EJSON.equals(value, elt.value)) {
        // this subscription is changing the value of this field OR it's a deep merged field
        changeCollector[key] = value;
      }
      elt.value = value;
    } else if (!deepMerge) {
      // this subscription is newly caring about this field
      precedenceList.push({subscriptionHandle: subscriptionHandle, value: value});
    }

  }

the main change is the else if (deepMerge ...

From a functional perspective it seems to work, however we’re wondering if there are any possible problems with this approach that we’ve missed, or even if others have solved this problem differently (beyond a schema redesign).

One improvement we’re already considering is to allow not just true/false - but a list of keys for which a deep merge should be used.

znewsham · September 3, 2019, 2:21pm

giving this a bump, just in case everyone missed it over the weekend

ramez · September 3, 2019, 11:36pm

@znewsham
I see the value in deep merge, I had explored before (mentioned it somewhere in this forum). Well done implementing it. It should be shared publicly

What we have done instead is that we control our pubs manually with observers within the pub and manually sending the changed / added / removed event

We also use redisoplog (which does store the data but aggregately, I.e. not per user) which triggers the observer events

We went further and stopped using mergebox altogether in favor of letting the client do the merging, by sending ddp changed / added / removed. Here is why: we found we had little overlap in collection data being sent down. So the overhead of storing data in the mergebox and then computing changes was anecdotally too high for us with little benefit. This way we reduced server load and memory per user (thus serving more users), we consume a little more bandwidth (it’s compressed by Nginx so minor impact) and offload the merging to the client’s cpu (which is free for us)

Just wanted to share another alternative to potentially solving your problem

znewsham · September 4, 2019, 2:47am

Interesting solution - I guess from the mergebox side you just track which documentIds each client has (in order to send changed/removed events), you then just send all changed events for a document, but added/removed functions in the same way?

The same might actually work nicely for us, we have several hundred teams that use our platform, so users often care about many of the same fields in a document, enough that the overhead of sending all change events might be preferable.

ramez · September 4, 2019, 3:38am

Right.

There are two ways you can do this

Easy way:
meteor add peerlibrary:control-mergebox
And call this.added, this.changed and this.removed (check meteor pub/sub docs for an example)

Hard way - but less overhead:
In your pubs

this._session.sendAdded('content', id, doc)
this._session.sendChanged('content', id, doc)
this._session.sendRemoved('content', id)

You will likely get an error client-side when you send added for an existing doc, removed or added for inexisting docs. You can get inspired by the package to overload the client-side collection.js in mongo package to add the following (taken straight from the package above) at the top of update(msg) method:

        // If a document is being added, but already exists, just change it.
        if ((msg.msg === 'added') && doc) {
          msg.msg = 'changed';
        // If a document is being removed, but it is already removed, do not do anything.
        }
        else if ((msg.msg === 'removed') && !doc) {
          return;
        // If a document is being changed, but it does not yet exist, just add it.
        }
        else if ((msg.msg === 'changed') && !doc) {
          msg.msg = 'added';

          // We do not want to pass on fields marked for clearing.
          for (let field in msg.fields) {
            const value = msg.fields[field];
            if (value === undefined) {
              delete msg.fields[field];
            }
          }
        }