Meteor APM, why is fetch so long?


#1

Hello,

I am using Meteor with Galaxy for a very long time, I always add APM to have an overview of my app.
This is working great so far.

By looking into details, I have a find on a user and a fetch that is taking ages.

I have no idea what can cause such a long time on fetching a single user by id.

Do you guys have experienced such problem ? Do you have any idea why it is taking so long ?

Thanks,

Ps: I you are wondering why there is a fetch before every other operation on a collection, I am using a multitenancy package in my app


#2

It could be that you don’t have an index for the _groupId field in the collections that are working in multi-tenancy fashion, so those fetches will be slow.

If you are using simpleschema to define your user schema(or any toher collection that has de the _groupId), you could try this to add an index:

 _groupId: {
        index: 1,
        type: String
    },

Otherwise you will have to use something like

Meteor.users._ensureIndex({_groupId: 1});

#3

Thanks @tcastelli

I have already this index for all my collections, also the query is by _id which has an index on it by default. Or maybe I am wrong ?

What I do no get is the difference between find and fetch.
My guess is that it is not a database problem, because mLab tells me all the slow queries I have.

Here is the code of the publication :

Meteor.publish("validationsForStudent", function publish(
  studentId,
  skillId,
  blockId
) {
  check(studentId, String);
  check(skillId, String);
  check(blockId, String);
  this.unblock();
  if (!this.userId) {
    return this.ready();
  }
  return [
    Validations.find({ studentId, skillId, blockId }),
    Skills.find({ _id: skillId }),
    Blocks.find({ _id: blockId })
  ];
});

After pasting this, I had a tought that I do not need to run through multitenancy hook in this case, I could write

Meteor.publish("validationsForStudent", function publish(
  studentId,
  skillId,
  blockId
) {
  check(studentId, String);
  check(skillId, String);
  check(blockId, String);
  this.unblock();
  if (!this.userId) {
    return this.ready();
  }
  return [
    Validations.direct.find({ studentId, skillId, blockId }),
    Skills.direct.find({ _id: skillId }),
    Blocks.direct.find({ _id: blockId })
  ];
});

Because there is no ambiguity on the find, it is relevant only when I write

Collections.find({})

There is no link with my initial question but this might be of interest for some people.

But I still do not get why I am doing a fetch in a publication that is suppose to return a cursor.


#4

Yes you are right I didn’t pay attention to that. Since the query is by id, and id has always an index, it should always be fast. Fetch time,however, is determined by the size of the document/s fetched and the connection with the database. Also I guess that if you have the field fetched from memory (with an index), it should be faster than fetching from disk.

Since both things are covered I’m not sure what is making your fetch slow(document size is small), so maybe it is related to connection issues at that moment in time.

Have you tried to fetch that document again manually and compared times to see if that time is always the same?


#5

No time is not always the same, and I can tell that by looking at the image above : each time I find on a collection I first find and fetch on the users collection.
And times are different :

  • 1001 ms
  • 100 ms
  • 97 ms

That is very strange.

Any explanation ?


#6

Again here :

It spends 3291out of 3993ms just for fetching the same user several times


#7

If your server has high CPU usage then Kadira will report incorrect numbers on query response time. Kadira just measures the time from (db query start) -> (db response processed). If the CPU is high, then it might take a while for the db response to be properly processed, even if the db responds quickly.


#8

I see similar things on my server with low CPU, so I don’t think CPU is the issue.


#9

I know this caveat.

I thought the db part was on the find part of the graph.
I still have difficulties understanding the difference between find and fetch. Can you help me on that ?


#10

The find portion doesn’t actually contact the db, it just constructs the db cursor. So the find should never take much time. So all the time will be in fetch.

Is your server under high load while these subscription traces are recorded? That is pretty much the only thing I can think of that would explain the high latency and high variance in latency of the queries.


#11

Either the user document is very large, you did not enable Oplog, or there’s something missing from the code you shared.

A possible error is that your one client has subscribed to the users collection many times (even just a few dozen times, like one subscription per “row” by accident on your client). This will cause a quadratic increase in time to process a simple lookup inadvertently, because each subscription returning results has to merge with other results on that collection, which may require iterating through all of the user’s results.


#12

Is your server under high load while these subscription traces are recorded?

I would say no, but it is by looking at kadira CPU graph, which I have read to be not accurate in a galaxy environment. So it is quite hard to be certain about that.
When I look at galaxy graph, it is also difficult to tell because only the “last 5 minutes” graph is precise, otherwise, values are interpolated and it is hard to find peaks.

What am I beginning to queried is that I unblock all my subs and my methods, so there might be a CPU spike which cannot be seen from Kadira.

Do you think it is a good idea to unblock everything ? My goal is to decrease “wait time” by doing that.


#13

I have Oplog enable, and the user collection is the orginal one, I just add a roles field and a _groupId field to handle multitenancy.

Nevertheless thanks for reminded me about the potentiel quadratic issue.

So far my conclusions have been that I have CPU spikes, maybe due to too much unblocking ( methods and subs ), it is hard to tell because kadira does not register it and galaxy graphs are not helful on that too.


#14

I have started seeing the same thing on my end. User find+fetch taking 2000ms, everything slowing down.

I can confirm that it came from our server having unexpected CPU spikes and high RAM usage. Here’s a screenshot from kadira, with a manual restart at ~10:00, that clearly resets RAM and CPU usage.


#15

I think you are looking in the wrong place. If I am reading your screen shot correctly, you did 3 find/fetch for the same user ID. 2 were slow >1s, 1 was 128ms.

Who is your DB host? That is the place I think you should be looking.

Second point… And equally important. Why are you doing 3 queries for the same document?!! Perf rule #2, don’t do anything twice that you can do once. (perf rule #1 is don’t do anything once that you don’t have to)

The source code you showed is not the same as the trace, so I have to speculate here. If you are doing Meteor.user() that is bad. If you are, do it once… and store it in a local var.

Better, do Meteor.users.find({_id:…},{fields:{}}) to limit your payload.


#16

Hey @oemig, thanks for your answer, I have tried to explain the multiple fetch of user on my first post :

I have realized that in most cases I do not need to pass throught hooks of this package.

In term of Database, I a using Mlab shared plan on the same region as Galaxy server, what metrics should I look at to understand if it is a bottleneck on mLab parts ?
MLab uses Telemetry I am not very familiar with it.

Thanks again for taking the time to answer


#17

@sabativi did you advance in this issue?

I’m with the same problem here. The doc size is just “27” and fetching just a field (_id)
I dont know what to do :frowning:

And my app is using the partitioner package too.
The collection has all index needed.


#18

Hey,

Yes and no.

Yes, because in 95% of cases I was able to bypass the find hook that is defined inside the package, by just calling Collection.direct.find

No because I still have 5% of uses cases where this happened.

The next step for me is to try userCache