Strange phenomenon: Subsequent methods or subs won't succeed

jpperrin · February 5, 2021, 7:52pm

This happened to me several years ago, so I can’t remember what the exact issue was, and Meteor could have changed since then. But I started using this.ready() in all my publications to ensure that they are returned.

Not 100% if all this info is still relevant, since this is from 2015, but this post explains things: This.ready in a publication? - #10 by manuel

waldgeist · February 5, 2021, 8:26pm

@peterfkruger was so nice to have a look at the system with me, in a longer web session

What we noticed, and what’s pretty strange: The production server and the staging server behave completely differently, using (almost) the same code base.

On the staging system (which is running on a t2.small instance), a typical pub trace looks like this:

which is pretty normal.

Whereas on the prod server (running on a t2.medium), it looks like this:

My explanation for the multiple subscriptions: This is a map where we panned across, which causes parameter updates and thus re-subscriptions for another map viewport.

The reason why one of the sub takes longer than the others is that this was on a low zoom level so there’s more data on the map. But still, we’re talking about a hundred vs. 5-6k elements, this won’t explain why it takes 1000x longer.

Plus, we did exactly the same user interactions on both the prod system and the staging system, so we are wondering why on the staging there is almost no wait time, whereas on the prod there is (same data)? There was no significant user load on both of the systems at that time.

The reason why “Not using oplog” shows up here is that I am using a $nearSphere query for this sub. As I just recently learned, Meteor falls back to non-oplog in this case. But that’s also the same for the staging system. I even stopped and restarted the EC2 instance now, but the problem still persists. I’m pretty clueless why this happens.

(Another very strange thing: If I start the app in Chrome right after starting the browser, the initial map load works fine. But as soon as I reload the page or pan to another map location, the sub starts to stall again. On Firefox and Safari, however, the problem occurs even on initial page load. @peterfkruger accessed the same systems on his PC hundreds of miles away, with the exact same results. So it’s not a network issue or something like that.)

What puzzles me the most: How can the same app under (almost) identical conditions (slightly larger EC2 instance, larger MongoDB cluster on the prod system) behave so differently?

waldgeist · February 5, 2021, 8:31pm

I am curious why this is even necessary. Is it because the DDP connection waits until the first ready() is being called? I mean, pubs don’t really “end”, unless you unsubscribe.

jpperrin · February 5, 2021, 8:37pm

From Publish and subscribe | Meteor API Docs

If a publish function does not return a cursor or array of cursors, it is assumed to be using the low-level added / changed / removed interface, and it must also call ready once the initial record set is complete .

waldgeist · February 5, 2021, 8:41pm

This should only be necessary for pubs using the low-level API. But maybe I missed something?

jpperrin · February 5, 2021, 8:51pm

Yeah, you’re correct. But if you don’t return a cursor or an array of cursors from the pub, then it assumes your using the low-level API.

So it’s just a best practice that you ensure that all pubs are returning either a cursor, an array of cursors, or this.ready() to avoid any potential issues.

Taken from http://docs.meteor.com/#/full/meteor_publish

// server: sometimes publish a query, sometimes publish nothing
Meteor.publish("secretData", function () {
  if (this.userId === 'superuser') {
    return SecretData.find();
  } else {
    // Declare that no data is being published. If you leave this line
    // out, Meteor will never consider the subscription ready because
    // it thinks you're using the added/changed/removed interface where
    // you have to explicitly call this.ready().
    return [];
  }
});

As explained in this comment

this.ready() is more straight forward than an empty array

afrokick · February 5, 2021, 9:30pm

@waldgeist did you try to debug client side sub with meteor dev extension for Chrome? It’ll tell you about ddp messages, local collection size. Maybe you receive a huge initial batch of documents?

Where are you locate db for staging and prod? At the same DC with app?

Do you use any proxy like nginx?

waldgeist · February 5, 2021, 9:38pm

Yes, I did. On the staging system, the data comes in steadily, while on the prod system it comes in chunks, with long delays inbetween.

Yes, that’s correct (if 6k small documents is huge), but this is exactly the same on both systems. Also, if I zoom into the map, the number of documents gets reduced drastically, but the problem still appears if I pan around (only a handful of docs will be changed in this case). On the staging system, the data comes in almost instantly (if panning around the map), while on the prod system it takes ages.

Both at Mongo DB Atlas. Staging uses the free tier, while prod uses a paid tier. Otherwise, they are identical.

No, because it’s two different providers (AWS vs. Mongo DB), but that’s also the same in both cases.

EDIT: It actually does run in the same datacenter at AWS. Forgot that I deployed the ATLAS cluster there.

Yes, on both systems. I am deploying with mup.

rjdavid · February 6, 2021, 2:48am

Any chance of an index issue? Were you able to rule out that this is not a db issue?

waldgeist · February 6, 2021, 6:49am

Indexes are the same, since the SimpleSchema definitions are the same. Also the stats on Atlas looked pretty much the same. In Monti APM, I couldn’t find a way to see the database query time for pubs. But for methods, they are neglectable, as you can see in the first images above.

The problem seems to be caused by the immense “wait times” in the pubs.

rjdavid · February 6, 2021, 6:53am

I do not know this. Is there a way to automate the db index using SimpleSchema? We have to manually do this on a per query basis

waldgeist · February 6, 2021, 8:01am

It’s actually part of Collection2, which extends SimpleSchema (both by aldeed):

rjdavid · February 6, 2021, 8:39am

Thanks. Learning a number of things from this thread

waldgeist · February 6, 2021, 9:57am

I unblocked my pubs now, but this resulted in a noticeable performance degradation.

For a test-case, I zoomed into my map, waited until it has been rendered, and then zoomed out again. Without the unblock, the rendering of the larger view-port takes about 2.3s, but with the unblock, it takes 2.8s, sometimes even over 3s.

I also deployed this version to the prod system. But nothing changed. The problem still persists, the app is barely usable.

CORRECTION: It’s in fact a bit better now on the prod system. The initial wait times are gone:

But still it takes much, much longer than on the staging system until the actual content is being rendered (this is what made me believe the sub did not return). The same operation, which takes 2s on the staging system, takes 12s - 16s on the prod system. Same code base. And it doesn’t really explain why the staging system did not show these problems in the first place, even without unblocks.

The same query on the staging system looks like this:

That’s a factor 10 in oplog wait time (17.1s vs. 1.5s), which explains the huge delay. For the same amount of data. The staging database is a 1:1 copy of the prod database, copied with 3T Studio.

waldgeist · February 6, 2021, 12:36pm

Ok. I temporarily connected my staging system to the Mongo DB cluster of the prod system. And voilà, it showed the same problems. Seems as if the Atlas cluster is the actual culprit. Now I have to find out how to restart that thing.

waldgeist · February 6, 2021, 1:31pm

Update: Moved the database to a new cluster now, and everything is back to normal. This wasn’t really related to the original question, but at least the server is up and running again.

peterfkruger · February 6, 2021, 1:46pm

Yay That’s quite something, we’ve been chasing a ghost!

waldgeist · February 6, 2021, 2:19pm

Yeah, thanks again for your support!

PS: I still don’t understand why it made a difference when I started Chrome from scratch, if the database was the culprit. But well…

waldgeist · February 12, 2021, 9:33am

Hi @filipenevola I just wanted to report back that adding this.unblock() to a publication has a serious side-effect I was not aware of. If you do that and change the sub parameters, the pub will send all documents again, even if they already had been part of the scope of the original sub. You can read more about that in this thread:

I have been chasing this problem for days now, and finally remembered that I had added this.unblock() to all pubs. And as soon as I removed this, the pub behaved as I expected.

For a geospatial query like in my case, this side-effect is fatal, because thousands of results might be removed and re-added, which causes significant delays. It also explains why my pub got noticably slower after adding this.unblock.

Is there any documentation anywhere that explains this pub behavior in more detail? Even the DDP docs don’t describe this. I even tried to analyze the DDP code, but this magic seems to be implemented in the server-side Minimongo oplog merge mechanism, which I learned about just recently.

The question is: How can I now work-around the original problem described in this thread?

afrokick · February 12, 2021, 10:38am

If your subs transfer a lot of documents and it is high load, you should replace it with methods and custom cache layer(over minimongo or another state management like mobx/redux).

Then you can add debounce/throttling to send requests as you need.

Also, you can reuse cached docs from cache and don’t receive it again(if your geo-points immutable for some time).

We use it to get users info like name, birthday etc., avatars.

Another case - chat application. We load channels and messages at startup, then subscribe to redis events to update data about channels/messages.

Yeap, it is need more works to done, but works more predictable and performance than raw pub/sub.

PS Pub/sub works great at some cases. But for high load it should be ignored.