Reactivity problems after migrating to meteor 3.0.3

Hello helpful meteor experts :wave:

we are having issues tracing down some reactivity problems after updating to meteor 3.0.3 . We did all the async migrations already on the last stable build still running meteor 2.10.0. The 3.0.3 build just updated meteor, all the packages and the node version. Building worked without problems and the build passed all tests on our staging environment and even ran for a couple days on production without us noticing any problems.

Then we realised that sometimes some Meteor methods are not working correctly anymore, but we can not reproduce it reliably. It only happens on production, not on any other environment (staging, develop,…), which run the same stack and build just with lower server resources and less traffic. Even on production there are stretches of up to hours where everything works, followed by hours where we can reproduce the problem.

The problematic methods run through without error and the connection receives the result ddp message and also the added and changed messages of changes to subscribed documents, but the updated message in the end is missing, which results in the message callback not being run. We are still using the old Meteor.call with callbacks. Those are async server only methods, so no stubs involved, which, if we understand the migration docs correctly, should still work fine.

We are making heavy use of subscriptions (which is probably unnecessary) on a lot of collections, but it is only one area of the webapp (a lower traffic area even) with a couple methods related to one subscription, where we are having issues. Some methods making changes to the same subscribed data, are always working completely fine, others are missing the updated message. App and mongo server load seems fine and even doubling resources did not help. Rolling back to the 2.10.0 build removed the problem.

I know that it is probably hard to help without a way to reproduce the problem, but maybe someone has some advice of how to debug in this situation, without using production as a testing environment. If there is some DDP under the hood documentation, or instructions of how to add logging to meteor internals, that would be helpful as well.
From what I gathered from code comments in github, the updated message is called to signify that all updates to the subscribed data changed by the method have been send out. However the added and changed messages are coming in (s. screenshots below), but the method is still stuck not sending the updated. I did not really get the details, how the system works though. Any ideas of what might be going wrong there are appreciated.
I added simplified example code of the subscription, a working method and another one that has this problem, in case that helps.

Unless we get a good hint, we will have to stay on meteor 2, change our method calls to async and get rid of all the unnecessary subscriptions and then try again.

Thanks for sticking with me and any potential advice :pray:

Screenshot: Missing updated message

Screenshot: Working updated message

Simplified example code

The “docs.createCopy” method has the problem
The “docs.removeCopy” method always works

Meteor.publish("docs", function(docId) {
  <checks and validations>
  return Docs.find({ $or: [{ _id: docId }, { referenceList: docId }] });
});

Meteor.methods({
  "docs.createCopy":async function (
    docId: string,
  ): Promise<Doc> {
    <checks and validations>
    const doc = await Docs.findOneAsync({newDocRef: docId});
    const newDoc = {referenceList: doc.referenceList ? [...doc.referenceList, docId] : [docId], ...moreProp}
    const newDocId = (await Docs.insertAsync(newDoc))._id;
    await Docs.updateAsync({ _id: docId }, { $set: { newDocRef: newDocId } });
    return { ...newDoc, _id: newDocId }
  },
  "docs.removeCopy": async function (docId: string): Promise<void> {
    <checks and validations>
    const referencingDoc = await Docs.findOneAsync({newDocRef: docId});
    await Docs.updateAsync(
      { _id: referencingDoc },
      { $unset: { newDocRef: 1 } },
    );
    await Docs.removeAsync({_id: docId });
  },
})
4 Likes

I have no production deployment yet but am starting at 3.0.4 - rapid prototyping mission critical system migrations. I will take your case to heart and try to reproduce at mid-scale while I have the old system still running in parallel to compare and see if anything gets missed. ( Mirroring redundantly across system eras. )

Great near-repro there. Can we hear more about your surroundings to stay apples-and-apples in comparison?

I have Vue.js for the client, and also run a persistent headless client cluster so I can probably spot this too even if you have another situation on the overall.

Glad to have a scare to prevent happening to me.

2 Likes

Sure, most interesting is probably, anything closely related to meteor core logic:

Mongo 6.0.18
  • hosted on atlas
  • 3 app nodes, 2 analytics nodes (on production)
  • other envs are on the same version
  • I checked all the monitoring metrics there and there is nothing out of the ordinary. Our cluster needs to scale to the next tier under load, which was not possible when this issue occurred (because the next tier was overloaded - it was just disabled in atlas - wtf), but the issues persisted even under the next higher tier.
Meteor versions

accounts-base@3.0.2
accounts-oauth@1.4.5
accounts-password@3.0.2
allow-deny@2.0.0
autoupdate@2.0.0
babel-compiler@7.11.0
babel-runtime@1.5.2
base64@1.0.13
binary-heap@1.0.12
blaze-tools@2.0.0
boilerplate-generator@2.0.0
caching-compiler@2.0.0
caching-html-compiler@2.0.0
callback-hook@1.6.0
check@1.4.2
core-runtime@1.0.0
ddp@1.4.2
ddp-client@3.0.1
ddp-common@1.4.4
ddp-rate-limiter@1.2.2
ddp-server@3.0.1
diff-sequence@1.1.3
dynamic-import@0.7.4
ecmascript@0.16.9
ecmascript-runtime@0.8.2
ecmascript-runtime-client@0.12.2
ecmascript-runtime-server@0.11.1
ejson@1.1.4
email@3.1.0
es5-shim@4.8.1
facts-base@1.0.2
fetch@0.1.5
fourseven:scss@4.17.0-rc.0
geojson-utils@1.0.12
hot-code-push@1.0.5
html-tools@2.0.0
htmljs@2.0.1
http@1.4.4
id-map@1.2.0
inter-process-messaging@0.1.2
launch-screen@2.0.1
localstorage@1.2.1
logging@1.3.5
meteor@2.0.1
meteor-base@1.5.2
meteortesting:browser-tests@1.7.0
meteortesting:mocha@3.2.0
meteortesting:mocha-core@8.2.0
minifier-css@2.0.0
minifier-js@3.0.0
minimongo@2.0.1
mobile-experience@1.1.2
mobile-status-bar@1.1.1
modern-browsers@0.1.11
modules@0.20.1
modules-runtime@0.13.2
mongo@2.0.2
mongo-decimal@0.1.4
mongo-dev-server@1.1.1
mongo-id@1.0.9
npm-mongo@4.17.4
oauth@3.0.0
oauth2@1.3.3
ordered-dict@1.2.0
percolate:migrations@2.0.0
promise@1.0.0
random@1.2.2
rate-limit@1.1.2
react-fast-refresh@0.2.9
react-meteor-data@3.0.2
reactive-var@1.0.13
reload@1.3.2
retry@1.1.1
routepolicy@1.1.2
service-configuration@1.3.5
sha@1.0.10
shell-server@0.6.0
socket-stream-client@0.5.3
spacebars-compiler@2.0.0
standard-minifier-css@1.9.3
standard-minifier-js@3.0.0
static-html@1.3.3
templating-tools@2.0.0
tracker@1.3.4
typescript@5.4.3
underscore@1.6.4
universe:i18n@3.0.1
url@1.3.3
webapp@2.0.2
webapp-hashing@1.1.2
zodern:types@1.0.13

Full build script

Changes:

  • Node 20.17.0
  • debian:bookworm-slim (had to upgrade from debian:buster-slim)
  • –legay-peer-deps is necessary now
FROM debian:bookworm-slim AS buildenv
# install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
curl \
git \
python3 \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Create build user (security: don't use the root user)
RUN groupadd -f build && \
useradd -l -g build -m -d /build build
USER build

# Install meteor
ADD .meteor/release /build
RUN curl "https://install.meteor.com/?release=$(grep -Po '(?<=METEOR@)[\d\.]+' /build/release)" | sh
ENV PATH="${PATH}:/build/.meteor"

# Copy npm package info while granting ownership rights to the build user
ADD --chown=build:build package.json package-lock.json /build/app/
WORKDIR /build/app

# Install npm build dependencies
RUN meteor npm install --legacy-peer-deps
ADD --chown=build:build . .
RUN METEOR_DISABLE_OPTIMISTIC_CACHING=1 meteor build /tmp --headless --server-only

# Update this on every change to .meteor/release
# The recommended node version is specified in the README of the compiled app.tar.gz
# Multiple FROM commands are executed on seperate layers (may exchange data).
# The last layer is used on container startup
FROM node:20.17.0
ENV PORT=4000
EXPOSE 4000

# This is the command executed by default on container startup
CMD ["node", "main.js"]

COPY --from=buildenv /tmp/app.tar.gz /
RUN tar -zxf app.tar.gz && \
rm app.tar.gz

# install runtime dependencies
WORKDIR /bundle/programs/server
RUN npm install --legacy-peer-deps --production

WORKDIR /bundle
USER node
  • React 18.3.1
  • Default build tool

Happy to provide more details if there is anything else of interest?

@kfritsch, is there any chance that you are stress-loading this instance? Or this might be happening when the system is lacking resources?

There was a pending issue posted by the core team related to performance wherein under stress, Meteor 3 missed some updates compared to the same app running in Meteor 2.

2 Likes

I doubt that the app servers were stressed at that time, at least not when it comes to cpu or memory (The spike to 100% was after we were back on meteor 2). Is there another metric I should check there?

App server cpu and mem

When I look more carefully at the mongo metrics then yes, the mongo primary node might be stressed quite a bit. The mongo memory usage is at max and keeps maxed even when the tier is increased.There is supposedly no swap usage, but I think cpu says differently. IOwait and softIRQ are rather high and there is a constant disc queue as well. The short M40 period (24th ~10am-2pm) looks okay cpu wise but memory is still maxed out. Problems were still noticeable during that time. Are there any other important metrics I should keep an eye on here?

Mongo metrics




So I guess reducing subscription usage and writes where possible and optimizing oplog tailing might actually help. And we should probably stay on M40 until we can decrease the load.

Took me some time to fully grasp what those graphs are showing. Thanks a lot for the hint @rjdavid :pray:

@kfritsch, check this topic

Another metric to watch for is event loop latency, as you might not need the CPU at full load for Node.js to experience performance issues.

Once the event loop starts to lag, things slow to a crawl or get out of sync. It seems to happen in high-load scenarios due to garbage collection, a huge number of callbacks, promises, I/O, or many other potential reasons that can block the event loop.

And that perhaps combined with Async Local Storage/Async Hooks, which Meteor now heavily reliant on, might cause a bad combination in some extreme cases.

We will look into it to understand what might be happening and will keep you posted.

2 Likes

We were not aware of the whole node event loop latency topic. We will have to add that to our tracing, thanks for pointing that out. We will also follow the meteor 3 performance ticket, but I think there is actually quite a bit we can do on our side to reduce our resource usage. Will give an update when we manage to upgrade to meteor 3 without issues :crossed_fingers:

1 Like

Alongside the specific question, wondering if any Meteor ninjas or historians can recall if there are other measures being taken in the wild, with or without performance refactoring done first. Not expecting to keep old/bad/lesser practices under the hood, but treating those like a simulation of future load to spread:

Curious if the/a standard remedy in that situation is to offload to background processing wherever possible, with a message queue? This seems like it might help while the innards are being refactored for performance, but then stay valuable after the technical debt is reduced for Meteor in future optimizations.

Not sure if anyone has BullMQ or similar implemented with Meteor for cases with I/O especially, which can be delegated out of the process; rather than a callback with I/O happening in the same process, when there can just be waiting without I/O for a reply on the message queue. Is there known advice on practices like this, rather than optimization/performance, going toward proactive decoupling and decentrality?

I come from a parallel/asynchronous/concurrent/distributed background, and I have been eyeing the Meteor design through that lens. In this case it seems like that might be popping through.

Is something like this train of thought how processing is delegated and load mitigated around here? What do you use for job management in your production app?

This looks familiar to what @radekmie is going for with DDP Router:

I am not aware if anyone is pursuing this on their own.

I’ve been researching real-time and performance for at least a few weeks, and as I understand it better, I think we need to revamp the architecture a bit, from researching turning key code into Rust/WASM, worker threads, or even a separate queue for real-time changes.

Right now, all document updates are processed eagerly, which results into things like “oplog flooding”. If we had a queue somewhere, I think it would be fine for it to delay execution a few seconds if there is a peak of updates. Perhaps even having some kind of back-tracking or analysis for dropping repeated operations, away from the main Node.js process, perhaps even written in Rust for speed.

Like DDP Router but at the app level, away from limitations from Node.js/Async Hooks etc.

I am having all sorts of crazy ideas, perhaps one of them is valid. However I want to refactor the mongo package to TypeScript as much as possible and try the safest bets first before trying some of crazier ones.

The next logical step in our roadmap in improving real-time seems to be DDP batching, grouping as many added/updated/removed calls together as possible and prevent spamming so many microtasks.

Then we perhaps go to Change Streams, and after that we will need to stop playing safe and go to some of the crazier ideas.

I would love to chat/meet about any of this if anyone is up to it.

4 Likes

Thanks very much for the detailed comments @leonardoventurini. Glad I am going down a known path but one less traveled, which might be an edge of Meteor … I would fan to flame your crazy side.

I am encouraged that you are brave and open minded, it seems. And I see but dismiss burn-in from the current way it is; you are immersed in how it is now, but your mind sounds of breaking out of the jailed shell of how we experience Meteor today, I believe. I meet you there and apply to be your shoulder devil.

Going to take time to consider what you said at more depth as it pertains to the current way and circle back to discuss further, toward Meteor improvement versus just my implementation. Before thinking from Meteor itself then, I will think from “the situation in the wild” as it is, with or without Meteor … and share direction a bit in trade to your bravery within Meteor today. Based on your comments I can feel where the edge of the issue is now.

I am coming in fresh to Meteor but with a lot of history. Especially after working with the Actor model for many years. As well as writing microservices architectures ( using message queues ) and dealing with ZeroMQ at great depth. I deployed Actors in pretty much every type of inter/trans/action you can imagine.

Eventually it got to be too cookie-cutter, and limited to a certain approach when it no longer fit. I saw distribution of load is actually a clue we are doing design wrong, at the level of fundamentals. I think of how we overload IMAP and SMTP, how SMS / MMS and SIP are still based on Telegraph and PTSN, etc. Anyone looking at 10 DLC in that spirit will noticed we cannot make enough acronyms to cover our shame, and we are running from being pretty much horribly ugly at a design level. That cannot be smoothed out at the edges but speaks to a rotten core. All those foundational systems were improvised, especially TCP/IP, and often commissioned by military, i.e. US DoD kicking off TCP/IP versus Recursive Internetworking for example, which dramatically impacts design, so much so RINA the infrastructure would completely obsolete this conversation right now. A lot of what we are dealing with is TCP/IP itself, and Meteor sits smack dab in the middle of that design warzone.

We use “message queues” today but they are still in the same lineage as smoke signals, which are the first real example of “synchronous telecommunication” before Telegraph, Telephone, etc. We are in that mental space still.

Now our improvised designs are obese and vestigial, and everyone deals with the issue in a silo. Everyone teaches design philosophies which are pretty much pure opinion versus lift and reuse natural structures. Anyway so I left that Actor model mental space, because it is running home from its own vasectomy and calling that performance, even though it is trustworthy and easy to supervise, and did massive-level scaling well. Rather than trivial things it is deployed in life-threatening situations, like energy distribution, covering regional electrical fail-over with no room for error on deadlocks or bottlenecks since thousands of people on the sensitive side might die pretty much right away, even just if a hiccup happened. Bear with me I am heading back to Meteor today now.

Reason to leave Actor behind is because it does not really do "Distributed Applications" which is what we are all working toward with multi-modal systems ( mobile, desktop, browser; multi-instance, multi-tenant, multi-location, etc. ) … it is a hack. A very scholarly, proud, section of duct tape reinforced with chewing gum. And this is the pressure point Meteor is sitting on the red button of, and so… I see it very simply, and would contribute this nudge:

There is more than server and client … there is also agent ( in my terminology, beyond Actor which is still too vague and abstract ) … and from there, I almost want to bet >70% moves out of the server application.

That approach though is an entirely different design pattern, similar to MVC in usefulness, going beyond the transient “full-stack” front/back-end hellscape, but it is easier to think of it as coming backward from the future, rather than remedying the dumpster fire we call CS/IT/AI/*aaS, with a partridge in a pear tree.

In the background of arriving to Meteor I have been coming in with a design style which forces the issue. So if you are at the edge of the rabbit hole… count me a friend and I would love to destroy your sense of orthodoxy :slight_smile: Right now I see Meteor as if we traveled from the Sun to Earth except for the last meter. That last meter is a head change, and getting out of the “smoke signal” era, then backporting to there, rather than being the heirs of quills and horses carrying slips of paper as determining factors in the rise and fall of ecosystemic evolution.


In the meantime I will keep thinking this way and get to a point of sharing code once I have the style above ported to Meteor shortly, which I would not necessarily share publicly out the gate. Thinking about how to broach that since it is right on the vein of the field, and goes back to before the 60s. Mostly I want to unleash it and not sign up to bikeshed, so I hug you for your bravery versus feel a need to speak in school. I would love to meet at the mind on totally annihilating all our core beliefs and expectations on what “excellent performance” means, to leap 1 meter which is all the difference in the world. I can thread-off somewhere else to add fuel to the fire of design bravery if preferred.

Thanks again and I am very glad this thread poked through to the most salient contemplations, in my opinion.

1 Like

Moved out of the way of this thread ( too late ) when I almost came back just to park my bike in the shed less slovenly, and caveat “smoke signal” comment. Expanded to the actual point here:

1 Like

We are heavily using Bull in our app, i.e., our instances running Bull have more resources than the traditional meteor servers, which clients directly connect to.

Our generic guide is to prevent any single client connection from affecting another client connection. Therefore, anything that requires disk access, heavy db processing, or access to non-local apis must be passed to Bull.

Ideally, all those processes are passive (the client does not need to wait for the result/output). For those identified as active, we first figure out if we can make them passive (mostly UX redesign).

Next, we have notifications built into most of our apps. Can we notify the user once the process has been completed, e.g., “Your marketing video is now available.”

Lastly, if we have no choice but to make it active, we use Meteor’s reactivity to inform the client that the output is ready.

Our servers do not wait for the Bull output.

1 Like

Vital distinction @rjdavid:

I am digging deep into BullMQ now, after wading in yesterday, moving over a large design. I will move over to the client/server/agent thread here as that progresses.

When you say “passed to Bull” then … do you have different applications which are not native Meteor applications so to speak? How many repositories is your overall implementation of Meteor if including “not client” and “not server” Bull focused instances? And how many of those do you have?

I run along the Crystal language lineage of compiled high-level OOP, and high-performance Ruby interpreters ( rubinius / jRuby before 3 ) which more or less worked to drain every iota of performance out of the container efficiently, so when going into node.js based systems I feel like I am flying a spaceship a city block. Trying to get over that feeling and be able to see the lightest possible and yet most versatile and resilient implementation. For example, down the line, I would see replacing the “agent” with Crystal if I can really get to language agnostic queues.

Pardon me for expanding the topic on this thread further but referring to another thread for reply, if you don’t mind @rjdavid. It is refreshing to hear your firm-crease take on things. Feels hardened. I anticipate focusing on this in Meteor in some form, even if it is code-back versus code-in. I feel like there is a fork in the road here on a design level, based on what really happens in the wild. That is why I ask “how many repositories do you really have” or “how many build processes” or “how many deployment processes” do you have, or is all of your “Meteor app” really one “Meteor app”

Our main Bull app is also being run under Meteor, but a headless one. We have non-meteor lambdas and workers, but are still invoked from the main Bull app.

We have a project comprised of 6 Meteor apps with shared code like collections, methods, helpers, etc, and shared DB (MongoDB and Redis). Therefore, we can do reactivity with changes coming from one app with the subscription affecting another app. One of those apps is the Bull app, which we further divide into 3 server groups according to the queue: HeavyCPU jobs, critical jobs, and the rest. Technically, these 3 groups run the same Meteor app but process different queues. I did not count but I will not be surprised if we have more than 100 queues.

What a monster @rjdavid :clap:

Nothing motivates like knowing the bad ass next to you in the space, and where the highest bar is today before it moves forward, and then again, and then again. I love that architectures like this are next door at all times, in Meteor land. Thank you for sharing broad strokes.

So many questions, but no rush; as each one comes up I would keep checking with you and others like you. I look at it as the HOA discussion, so to speak, as we each build and grow side by side here. You and others have a lot of history, but in your example, you did not go soft in all that time.


For now, I immediately wonder:

  1. How many Worker objects do you run per process/container as a rule of thumb? I understand load and work-type is different, but starting from “not doing anything just waiting” as in the general scope of this post… what is a safe limit on how many Worker objects per process/container, with some reference point on resources also?
  2. In my example in the wider thread talking about having flagship instances out there, and I even would Open Source the brand and company itself to be able to transparently show end-to-end at all points of participation… so I wonder about things like going from the iconic $5/m droplet from DigitalOcean out to the high-resource and mixed-resource specialized virtual machines… or even going from physical hardware specifications backward to how much Meteor we can deploy to that, since I push to house, office, and vehicle-based machines.
  3. How do changes to spreading load look usually, and how hard is it to re-flow the Worker instances in practice? I imagine not that hard since BullMQ is dropping into Redis and even if nothing was pulling jobs for a moment, they persist and get processed once the new Agent strategy is live.

This is starting to feel like a content focus to really unearth the awesome side, versus focus too hard on the startup side. As a principal ( lead/solo stakeholder with maximum executive leadership and no oversight ) it is far more motivating to think from the standpoint of demonstrating freedom, rather than selling the side-hustle community on getting out of their W2. I see a leap here between RYO MVP mindsets and the Visible-From-Space mindset, where these systems don’t just scale out, they also go deep, and one day will touch every aspect of the individual digital life. With the flexibility and power we have though, there is no actual need for a centralized/other provider, other than for regulated/third-party aspects that are abstracted out by default.

Anyway, pardon my HELL YEAH and thanks a lot for the peek into the excellence next door.

When we started, the biggest issue was lambda vs jobs with the issue of wasting instances if no job was running. We immediately realized that the savings of not running a small instance continuously is negligible than the complexity of not running a single Bull app. And when we changed our mindset about using jobs queue, we realized we have a lot of things that we can move to it.

We have a config file that dictates which groups process which queues. If we noticed that a job is eating too much cpu, we will move them to the heavycpu group with corresponding bigger instances. Scaling is just the same as other servers. We average running time of each queue and when there are long running jobs, the normal solution is “can we divide this further to smaller jobs?”

The initial tendency for developers is to start doing all in one job e.g. querying all records and processing all records one by one. Then one day, we realized that those records are now hundreds of thousands so better that we create a smaller job to process per record.

Hey @kfritsch I added this test for the updated message and seems to be working properly, please review it in case I am missing something: