Client Minimongo falling out of sync with server DB

Hello,

We are having major problems with out internal Meteor app which has about 300 clients connected at all times. For the last few weeks we have been pulling our hair out trying to resolve a bug in the client-server sync.

What happens is that seemingly randomly clients fall out of sync with the server for some collections. It seems like all (or at least most) of the clients fall out of sync at the same time.
The way it seems to manifest is that some documents in the collection seem to “freeze” at an older stage and no longer respond to collection updates. With Meteor devtools we can confirm that the update message is being sent to the client but Minimongo never registers the update. What’s more is that the document seems to be fully disconnected from the subscription. If we stop the subscription altogether we see all the documents in the collection drop and get removed from the collection except the ones that are causing the problem.

It does seem like this may be (?) related to nested object updates since the problem seems to mostly arise when we set a nested object onto the document. We’ve read on the GitHub that there was a problem with this that should be solved on Meteor 2.8 so we updated our production app to Meteor 2.8-rc.0 but the problem persists.

Notably we have two main servers, server A serves the majority of the clients and does the majority of the logic and server B hosts fewer clients but also runs the MongoDB which both servers connect to. It seems like the clients on server A are mostly affected although we are not fully sure how absolute that is. The load on both servers is well within limits for both CPU and memory.

What we have tried:

  • Updated to Meteor 2.8 rc.0 (problem was first noted on Meteor 2.2 on MongoDB 2.6 and we upgraded first to Meteor 2.7 and MongoDB 6 but when that failed to resolve the issue we upgraded to Meteor 2.8)
  • Refreshed subscriptions regularly (on an interval)
  • Gone over the publish and subscription logic with a fine comb, going so far as to disable most of the security logic to make sure it’s not being stopped at any stage.
  • Drastically limited the fields being published to reduce overhead.
  • Restarted the clients and servers multiple times.

For reference the collection in question typically has 1-10 documents per client being a few kB each except for our admin overview which subscribes to hundreds of documents for a few MB in total.

Luckily we control all the client hardware so our current workaround is hard refreshing all clients through our device management system on a three minute interval which is obviously a terrible long term solution but this app is critical to our business so it’s preferable to having unreliable reactivity.

Any advice, suggestions or ideas are most appreciated because to be honest we are at a loss here.

Edit: should have mentioned this: We are unable to reproduce this problem while running the server on our development machines. Even when connecting Meteor to the same production MongoDB instance that our production servers use.

Can you share the publication code and the schema of the documents?.

The nested object sounds a little suspicious, as far as I know meteor has never dealt well with publications using a nested projection and still doesn’t in 2.8

The schema is pretty complex and we have a lot of different fields. For simplification it’s a collection of orders with fields like order_id, customer_id, status, seller_id, history, merchant, delivery_address, etc. so information about the order itself and the delivery information. The field that we’re setting is the driver field, when we assign an order to a driver we set a driver field on the order object with the following schema driver: {_id: from the Drivers collection, telephone: number, name: string}. Maybe we should try refactoring this into just three separate fields {driver__id, driver_telephone, driver_name} to avoid the nesting?

Server publication:

Meteor.publish('orders', function (seller, status, filters) {
    if (this.userId) {
        let user = Meteor.users.findOne(this.userId);
        if (!authorizeSeller(this.userId, seller)) {
            //return [];
        }
        let statusOrders = {
            origin_db: seller.origin_db,
            $or: [
                {
                    seller_id: seller.entity_id,
                    merchant: user.merchant,
                },
                {
                    seller_publication_id: seller.entity_id
                }
            ],
            status: status
        };
        let fields = {
            fields: {
                history: 0,
                customer_email: 0
            }
        };
        if(!Roles.userIsInRole(this.userId, 'admin')) {
            fields = {fields: Orders.seller_private_fields};
        }
        return Orders.find(statusOrders, fields);
    }
});

Client subscription:

// inside a Tracker.autorun context which runs based on which user (both Meteor accounts user and seller) is selected
// in practice that means it's only run on startup
orderHandle = Meteor.subscribe('orders', SessionAmplify.get('seller'), status, SessionAmplify.get('orderfilters'), function () {
                Session.set('orders_loaded', true);
            });

The publication-subscription itself seems to work as it’s how we load the orders in the first place and they are initially loaded correctly before abruptly stopping registering any further updates.

Also, and I should have mentioned this in the original post, we are unable to reproduce this problem while running the server on our development machines, even when connecting Meteor to the same production MongoDB instance that our production servers use.

Creating index may help a little. In this case I think the index will be { origin_db: 1, status: 1, seller_id: 1, merchant: 1, seller_publication_id: 1 }
It’s not very effective because you’re using $or operator.

And if you don’t need these orders to be reactive (your clients see the changes immediatly) then you should use method to load the data.

Thanks for the suggestion. We already have separate indexes on all of those fields and more (created_at, updated_at and others) but we don’t have a compound index. Would it seriously affect performance? Keeping in mind that our MongoDB is effectively a cache for our main MySQL DB so it never contains more than maybe a thousand documents in the collection at any given moment.

Either way the performance of the application is fine, especially on the server side.

And the reactivity is absolutely essential. This is our real time management system that keeps everything in sync for all parties and the customer up to date in real time. Some of the problems this is causing is drivers showing up to pick up orders and the supplier never having been informed that he should start preparing because the reactivity broke, etc.

Separate indexes won’t help, mongodb will use one index for an individual query. If you’re so sure that your server has no problem then you may want to check networking and client app.

How sure are you about this? I’m not familiar with SessionAmplify but I’d bet that SessionAmplify.get('orderfilters') is reactive.

I think the key part of the problem description is this:

If you’re sure about this, it would indicate there is nothing wrong with the server, DB or network. In this case I’d look for client side errors (do you have bugsnag, or kadira or similar enabled for client side errors?)

Alright that’s good to know. I added some compound indexes for the most common publications and lookups.

At this stage I’m not sure about anything anymore. The reason I believe the server performance is fine is because according to Monti pub/sub response times and method response times are pretty reasonable and CPU and memory usage rarely or never goes high either.

Oh, yes, of course you’re right. SessionAmplify is a browser localstorage extension of Meteor’s Session to persist between browser starts. But those orderfilters are rarely changed except at startup as we largely moved to client side filtering through DOM manipulation for dramatically improved performance since those filters are mostly temporary in nature. It can still change in some cases but we’re talking maybe once or twice per day per client at most, typically 0 times.

Yes, we are sure about this or at least as sure as we trust Meteor DevTools Evolved. We can clearly see the update arrive through the DDP tab.

driver

Here is a picture for reference. If we now look at the minimongo orders collection those changes are not represented. And if we stop the subscription entirely we’ll immediately see all the documents in the collection dropped except this one which stays frozen at the pre-change state.

We get no client errors that seem related but we do sometimes get a server error that looks like it could be relevant:

The Mongo server and the Meteor query disagree on how many documents match your query. Maybe it is hitting a Mongo edge case? The query is: {"$or":[{"seller_id":88935,"merchant":923},{"seller_publication_id":88935}],"status":"complete"}

But I have no idea what’s that about and the timing does not seem to fit all that well.

Ah, yes I have seen that before and it is gnarly, I gather that you don’t use redis oplog? If I recall the last time I saw this error (some years ago) it was because a mongo selector had an undefined value, mongo filters out undefined values (as does ejson of course) but the meteor server keeps them in. So what happens is mongo returns x documents and meteor looks at them and only sees y documents (often 0).

When we had this bug there was nothing to do but restart the server, it made everything crash. Even method calls wouldn’t work anymore because if the method does an update it doesn’t “fully return” until meteor observes all those writes (a simplification, but broadly correct)

This can be quite hard to track down. Since the error I believe comes from the observer not the find, but you can find it by checking all your calls to find for undefined keys

1 Like

Oh man, I can’t believe it, thank you so much. Deployed a hotfix immediately, just setting all possible undefineds in publications to sensible defaults, and as of right now we cannot seem to replicate this issue anymore. We’ve got to go over this and solve the root problem of why those values were undefined in the first place but at least as of right now the system works again.

This error is really weird, a single malformed Mongo query makes the system into an unreliable mess in completely different places for users that share none of the same documents being published with the one that caused the original problem. Either way this is a huge relief and a weight off my back. Thanks again.

1 Like

4 years ago they made this questionable decision… The problem can be “solved” by patching the mongo_driver to set ignoreUndefined: false - or by moving to redis-oplog (the problem is with the observe, not the actual operation itself IIRC)

That thread is a very interesting read. Wasn’t aware of this behavior and it seems like we’re not the first to trip over it. I guess we’ll be a lot more careful with our publication parameters from now on.

We actually do use Redis in house but not for this app because the DB is so small. Only a few thousand documents for a total of a few MBs of commonly accessed data so DB performance is not an issue at all right now.

Looks like I have this same problem (METEOR@2.10.0).
Where is the mongo_driver file located on the file system? What is the best way to patch this to change ignoreUndefined?
Thanks,
+Eric

You can find it in ./.meteor/local/build/programs/server/packages/mongo.js. We ended up going a different route because we didn’t want to deal with the overhead of using a custom build. We just created a custom publish function which sanitizes the selector. So our publish function takes a callback that returns a Collection.publish cursor instead of Collection.find. A samples publication would be something like:

publish("orders", function(merchant) {
    return Orders.publish({userId: this.userId, merchant: merchant});
}, ['order-view']);

I don’t know if this is the best solution but it has worked well enough so far.