Meteor perfomance issues with collection observeChanges

luistensai · May 16, 2018, 2:11pm

Hi everyone!
I’m working with a meteor app that’s having some performance issues from time to time, and we’re not able to determine the cause.
I’ve found a way to run kadira in-house so we can take a better look at it, and found that sometimes there are really high times on collections observeChanges (I believe this is done by the publish/subscribe mechanism).
This is an example:

We do have an index on “profile.storeId” and “profile.accountId”, ran the query using explain and took nothing to fetch the users.

Is there any way I can troubleshoot this kind of problems? Any advice is much appreciated. Thanks!

diaconutheodor · May 16, 2018, 4:05pm

@luistensai how big is that cursor, I don’t see any limits in there ? How many documents are returned and how big ? You need to understand that observeChanges() fetches the data in full. Look here for scaling reactivity: https://github.com/cult-of-coders/redis-oplog

luistensai · May 16, 2018, 5:13pm

Hi @diaconutheodor
There are only 15 documents returned by that query, I checked in mongo and the cursor size is 1043 (I think it should be just a kb). All the 15 documents should weight less than 70k. We’re not trying to fetch a lot of users at the same time.
I’ve seen some issues with collection fetchs too, eg: trying to retrieve a user by email takes around 3 seconds.

You can see there’s a limit and there’s actually an index for “emails.address” in the users collection by default, and there’s actually any user with that e-mail, so it returned 0 results.
Thanks for your reply, will investigate redis-oplog, but our current workload isn’t really that much, I really don’t understand why some operations take too long.

veered · May 16, 2018, 6:59pm

Could you try taking a profile with https://atmospherejs.com/qualia/profile ?

satya · May 16, 2018, 8:13pm

If you are using observe / observeChanges, not that there’s a memory leak issue there. My advise: do not use in production! We’ve completely migrated away from using observeChanges, for example by replacing with a method instead.

luistensai · May 16, 2018, 9:20pm

HI @satya!
We’re not using observe / observeChanges afaik, are they just part of the normal publish/subscribe mechanism? If that’s the case, then we have a problem, as we have lots of subscriptions going on our apps actually.
Thanks for your reply!

hluz · May 16, 2018, 10:38pm

Do the performance issues you are having correlate with bulk updates on the db (not necessarily on the same collection or the same process)?

diaconutheodor · May 17, 2018, 4:54am

@luistensai you’re looking in the wrong place, it’s not Meteor to blame it’s the database, make sure you have proper indexes set up, and make sure it’s in the same intranet so it doesn’t spend too much time on the wire, finding a user by email 3s is ridiculously slow.

satya · May 17, 2018, 6:30am

No the normal publish/subscribe doesn’t have the issue.

However, I strongly suggest to move from publish/subscribe to methods any time you don’t need reactivity - and I’m pretty sure most of the time you don’t need it. Just fetching data over methods is so much faster. In our product (https://www.postspeaker.com) we’ve replaced almost all publish-subscribe, except where we really need reactivity. It was a dramatic speed increase.

Also make sure to always limit your fields for every db query, and implement proper indexes.
Last, make sure your database is in the same data center or region as your app.
And finally check if you can move (publication) query logic to the mongo aggregation pipeline.

If you follow these patterns your app will be fast.

luistensai · May 17, 2018, 12:37pm

@hluz we don’t currently have bulk updates on our db. Most of the updates are single document updates.

@diaconutheodor that’s a possibility, but finding the object actually takes nothing, the find call is above and it took 0ms, but fetching the document or observing the collection is really slow. Our db is in an external provider, ObjectRocket. They provide a direct wire to our datacenter; we have different applications running on express and they don’t have problems so far with the latency, so I’d think that the problem is in meteor pubsub mechanism or we’re doing something wrong.

@satya thanks for the advices, I think we need to move away from publish/subscribe. Looks like it’s an overhead right now. We liked the idea of reactivity, but sounds like it’s giving more problems than solutions right now.

As a side note, I should say that we notice the cpu going to 100% whenever we start seeing a rise in users logged into our app concurrently, so this might be related to the cpu spikes caused by Fibers and subscriptions.

diaconutheodor · May 17, 2018, 1:13pm

@luistensai just add redis-oplog and experiment with it, you don’t need to fine-tune anything, but I would definitely try it to see the outcome.

satya · May 18, 2018, 8:46am

There are other ways to solve reactivity. For example, you can poll a method, but you could also publish a count of the collection, and if the count increases, you fetch the actual data via a method. Or let the user click to fetch the new data (like Twitter does).

But in general Meteor’s pub-sub is an interesting idea that works for proof of concepts / mvps, but as far as I’ve experience for real production apps it is not scalable at all.

veered · May 18, 2018, 4:34pm

I don’t think there’s anything fundamentally wrong with pub/sub, it’s just that the diagnostic tools don’t make it super easy to figure out what’s going on.

smeijer · May 19, 2018, 8:12am

So the question is; does the query get slow because there is a 100% CPU usage, or does the CPU rise to 100% due to this observer.

These issues are more common lately:

github.com/meteor/meteor

[1.6.1] 100% CPU on Ubuntu server, caused by client?

opened 08:09PM - 04 Apr 18 UTC

KoenLav

confirmed Project:Server Performance

We are experiencing consistent 100% CPU usage on one of our Meteor deployments. …This issue seems to have appeared out of nowhere (not after a of a new version). What we already tried: create a new server with the same specs and deploy to that server. When we switch over our DNS to the new server at first all is well; but (we think) when a particular client connects the Node process starts using 100% CPU. We're deploying Meteor to a Ubuntu host using MUP (which instantiates a Docker container consisting of the meteord base image and our bundle). The image has NodeJS 8.9.4 and NPM 5.6.0. Any pointers as to how to pinpoint this issue would be greatly appreciated! We believe this is the most interesting portion of the V8 profiler we ran on the logs (https://paste2.org/zPsHbDya): ``` ticks parent name 2420054 92.5% /opt/nodejs/bin/node 219101 9.1% /opt/nodejs/bin/node 165262 75.4% Builtin: ArrayMap 57108 34.6% LazyCompile: *changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25 57067 99.9% LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23 57042 100.0% LazyCompile: *<anonymous> packages/meteor.js:1231:19 48280 29.2% LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15 47783 99.0% Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26 47783 100.0% Builtin: ArrayForEach 23827 14.4% LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:320:28 23616 99.1% LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23 23616 100.0% LazyCompile: *<anonymous> packages/meteor.js:1231:19 22206 13.4% LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26 22205 100.0% Builtin: ArrayForEach 15641 70.4% LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23 3405 15.3% LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15 2442 11.0% LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34 6566 4.0% Builtin: ArrayForEach 4420 67.3% LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23 4420 100.0% Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34 2082 31.7% LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34 2082 100.0% LazyCompile: *<anonymous> packages/meteor.js:1231:19 3827 2.3% RegExp: ([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)(?:=(?:([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)|"((?:\\\\[\\x00-\\x7f]|[^\\x00-\\x08\\x0a-\\x1f\\x7f"])*)"))? {9} 3812 99.6% Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26 3812 100.0% Builtin: ArrayForEach 25601 11.7% LazyCompile: *ObserveHandle.stop /bundle/bundle/programs/server/packages/mongo.js:1955:41 21756 85.0% Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:3663:25 21756 100.0% LazyCompile: *<anonymous> packages/meteor.js:1231:19 19776 90.9% Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32 1978 9.1% LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32 2805 11.0% LazyCompile: *baseUniq /bundle/bundle/programs/server/npm/node_modules/lodash.union/index.js:742:18 2805 100.0% LazyCompile: *<anonymous> packages/meteor.js:1231:19 2805 100.0% Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32 1040 4.1% LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:3663:25 1040 100.0% LazyCompile: *<anonymous> packages/meteor.js:1231:19 535 51.4% Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32 505 48.6% LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32 2944 1.3% LazyCompile: *v.map.value /bundle/bundle/programs/server/packages/ejson.js:678:18 2607 88.6% Function: ~Socket._writeGeneric net.js:708:42 2607 100.0% LazyCompile: *Socket._write net.js:785:35 1686 64.7% LazyCompile: *ondata internal/streams/legacy.js:14:18 495 19.0% LazyCompile: *Connection.write /bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:527:38 419 16.1% LazyCompile: *clearBuffer _stream_writable.js:469:21 179 6.1% Builtin: ArrayMap 57 31.8% LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15 52 91.2% Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26 4 7.0% Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1879:36 1 1.8% Function: ~changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25 55 30.7% LazyCompile: *changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25 55 100.0% LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23 31 17.3% LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26 31 100.0% Builtin: ArrayForEach 20 11.2% LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:320:28 20 100.0% LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23 5 2.8% Builtin: ArrayForEach 5 100.0% LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23 4 2.2% RegExp: ([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)(?:=(?:([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)|"((?:\\\\[\\x00-\\x7f]|[^\\x00-\\x08\\x0a-\\x1f\\x7f"])*)"))? {9} 4 100.0% Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26 4 2.2% LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1879:36 4 100.0% Function: ~SQp._run packages/meteor.js:851:21 3 1.7% LazyCompile: *callback zlib.js:447:20 3 100.0% Builtin: ArrayForEach 60 2.0% Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:740:22 ```

github.com/meteor/meteor

Event loop is blocked while unsubscribing

opened 11:02AM - 20 Apr 18 UTC

closed 06:27PM - 26 Feb 20 UTC

koszta

Project:DDP

While during subscription everything is okay even if you have a big publication,… it might take a while, but the data comes gradually, nothing is getting blocked in the meteor server. The problem comes when you are unsubscribing and it's removing the added documents. In this case 2000 documents with just 1 dummy field can already block the event loop for 200ms+! Which is even worse with bigger publications. During this everything is blocked (no subscriptions receiving updates, no methods are running, no HTML or static files getting served, no API calls getting processed). We noticed this problem due to the Kubernetes liveness probes constantly killing the meteor containers. Reproducible with Meteor version 1.6.1.1 on Linux and MacOS as well. Created a reproduction repo: https://github.com/fuww/blocked-unsub See the README for instructions. ``` I20180420-12:27:19.932(2)? [BLOCKED] Blocked for 226.95668499946595ms, operation started here: I20180420-12:27:19.932(2)? at Zlib.emitInitNative (internal/async_hooks.js:133:43), at InflateRaw.Zlib (zlib.js:231:18), at new InflateRaw (zlib.js:576:8), at Object.InflateRaw (zlib.js:575:12), at ServerSession.Session._getInflate (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/permessage-deflate/lib/session.js:130:31), at ServerSession.Session.processIncomingMessage (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/permessage-deflate/lib/session.js:26:22), at Functor.call (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-extensions/lib/pipeline/functor.js:46:32), at Cell._exec (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-extensions/lib/pipeline/cell.js:36:29), at Cell.incoming (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-extensions/lib/pipeline/cell.js:22:8), at pipe (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-extensions/lib/pipeline/index.js:39:28), at Pipeline._loop (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-extensions/lib/pipeline/index.js:44:3), at Pipeline.processIncomingMessage (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-extensions/lib/pipeline/index.js:13:8), at Extensions.processIncomingMessage (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-extensions/lib/websocket_extensions.js:133:20), at Hybi._emitMessage (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-driver/lib/websocket/driver/hybi.js:445:22), at Hybi._emitFrame (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-driver/lib/websocket/driver/hybi.js:405:19), at Hybi.parse (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-driver/lib/websocket/driver/hybi.js:141:18), at IO.write (/Users/peterpalkoszta/.meteor/packages/ddp-server/.2.1.2.1xtgy5z.fui7++os+web.browser+web.cordova/npm/node_modules/websocket-driver/lib/websocket/streams.js:80:16), at Socket.ondata (_stream_readable.js:639:20), at emitOne (events.js:121:20), at Socket.emit (events.js:211:7), at addChunk (_stream_readable.js:263:12), at readableAddChunk (_stream_readable.js:250:11), at Socket.Readable.push (_stream_readable.js:208:10), at TCP.onread (net.js:607:20) ```

luistensai · May 21, 2018, 4:36pm

I’ve been observing Kadira a bit better lately and fixed some issues, mostly observer reuse ones, that made the app behave a bit better, but issues still continue.
I’ve noticed the cpu spikes appear when we have lots of oplog notifications. EG:

What does this mean? Looks like there are updates modifiying a lot of objects at the same time and affecting all the subscriptions? Is there any way I can search for a place where this is being done?
Thanks everyone for helping!

nathan_muir · May 21, 2018, 11:04pm

@luistensai Yep, the issue is here in the OplogDriver

When documents are updated in MongoDB, the driver must decide whether it can apply the update directly from the oplog, or whether it needs to re-fetch the updated document.

The issue arises, when an update causes meteor to fetch thousands of updates simultaneously (rather than serially / batched), which causes a huge spike of fibers & cpu usage.

When you call collection.update( ... ), there are two main cases that may decide if a document “needs to be fetched”

Minimongo could not apply the modifier (complex/unsupported updates)
The update was on fields that you were publishing by,

eg.

// given a few thousand published documents
(new Array(2000)).forEach((n)=> collection.insert({ field1: 0, n}))
Meteor.publish(null, () => collection.find({field1: { $lte: 1 } }))

// If there's a subscribers, every document affected by this update will need to be re-fetched
collection.update({}, {$set: {field1: 10 }}, { multi: true });

Unfortunately the latter one is harder to avoid, eg, if you’re publishing archived posts to user A, and then user B archives many thousands of posts, it will trigger this issue.

luistensai · May 22, 2018, 12:34pm

So, I think we can probably avoid this in our app, as we’re currently not publishing more than a hundred records to each user, and we only have around 200 concurrent users. Many users belong to the same group and share same subset of records in the publish/subscribe. Users can only update documents one by one, but our rest api might be updating several documents at once.

My question is, everytime I update records from a collection, will it generate an oplog notification for each updated record on the active subscriptions? will that notifications be only for the subscriptions that match the updated object queries?

I really doubt we need to update 10M documents at the same time (please check the oplog notifications chart I’ve pasted), so our bug must be in our code, generating some sort of in chain reaction updates and collapsing our servers…

edit: Our active subscriptions count and subscription rates look constant all the way, we only experience cpu problems when we receive a LOT of oplog update notifications.

thanks @nathan_muir for the explanation!

hluz · May 23, 2018, 2:11am

That’s why I asked this above …

luistensai · May 23, 2018, 2:24pm

Yes, sorry, the problem is we actually don’t know if there’s a bulk update being done, but there might be one nearly 99% sure. I say this because our oplog spiked from 6k operations to 600k operations for the same amount of time on different time frames.

Thanks for all your help!

doctorpangloss · May 25, 2018, 6:13pm

I’m a little late to this conversation, but as a tip, your problem is probably a method that does does something of the form of:

Collection1.find().forEach(function(doc1) {
  Collections2.update({joinedField:doc1.field}, ...);
});

or

Collection1.find().forEach(function(doc1) {
  Collections1.update({_id: {$in: doc1.arrayField} }, ...);
});

or, most likely from the numbers you’re seeing in your oplog:

Collection1.find().forEach(function(doc1) {
  for (var i = 0; i < doc1.fieldB; i++) {
    Collections2.update({joinedField:doc1.field}, ...);
  }
});

Think about how you can restructure these to use an update with {multi:true}. Start by investigating your slowest methods in Kadira. If you can’t do a multi-update, make sure when you do these joins, you don’t accidentally replace the document; or, like in the third example, quadratically increase the number of documents you modify.

This is based on the reading of your oplog notifications, which generate a huge number of updates. Naturally, a publish and subscribe, as long as it’s not radically buggy, won’t update your database for you.

My bet is, actually, in your client code, you have a Collection.update that gets called, eventually, by rendering code, and coincidentally it is doing an extremely slow and noisy join across hundreds of concurrent users for you.