Is there a maximum number/size of documents that a meteor publication can handle?

My app has a kind of “forum” feature and a “user” publication to support fetching the data about the members of a forum. There is one very large forum with > 3000 participants. When a client tries to subscribe to all 3000+ ids using an $in: [ids] query the server goes nuts. The CPU pegs at 100% and eventually the instance is killed and replaced only for the client to kill the new instance again.

I use the same publication with documents numbering in the hundreds and there is hardly any CPU load. It seems that past a certain point meteor chokes for one reason or another.

Is there an upper limit on the number or size of documents a meteor publication can handle?

It’s a function of data over the wire not so much meteor.

I usually handle really large numbers of rows with continuous paging. Show the first 50, then provide a button at the bottom to get more 50 at a time etc.

Check out - https://www.discovermeteor.com/blog/template-level-subscriptions/

Someone here might have a better example but I got started with this concept and Meteor at the link I’ve shared.

Sure I am just going to paginate to solve this issue, but I find it curious that at a certain point meteor’s behavior changes and it simply hangs. This publication works for hundreds and even a couple thousand docs but once 3k is passed meteor hangs at 100% cpu and never finishes the publication.

1 Like

I think the issue here is most likely to be around how Meteor handles publications internally.

Basically, the server retains a copy of each connected client’s published documents in memory. This is to enable merge-box functionality and to allow Meteor to be stateful as far as client connections are concerned. The upshot is, that the bigger the publication size and the greater the number of connected clients, the more memory the server needs.

There are several ways to mitigate this, including:

  • Reduce the number of documents published.
  • Scale-up - use a bigger server or instance, and/or:
  • Tune NodeJS’s garbage collector - it’s not ideally configured as a "one size fits all, and/or:
  • Scale-out - spread your connected clients over multiple NodeJS processes. You could use several processes on the same server if threads are available. Each process needs to be on a different port (you can use e.g. nginx to balance these) and you will need some form of sticky sessions, and/or:
  • Go stateless - use REST or Meteor methods. These do not impact server memory, but may impact CPU.
4 Likes

The publication system doesn’t work well with a very large number of docs, and scaling your server will only temporarily mitigate the problem. Is there a good reason to subscribe to every single user in the first place? I cannot think of a good reason for this, so I’m curious.

If you only want the get the number of users in a forum for example, you’d be much better off just calling a meteor method that does the counting. Refresh it with polling if you really need it to be updated.

1 Like

An an example for your first question, in my case I had an admin portal and wanted to view all users in the database. Adding pagination to my sub/pub worked for me.

Another example is with an admin statistics page, in this case your second point is what I went with, a method was the best solution for my case.

There is no reason to subscribe to all of those users in the first place, it was just a bit of unoptimized code. I know I can work around it. I am just curious about the change in behavior that seems to take place when publishing a large number of docs, perhaps it is a bug in Meteor? My instances are running on Kubernetes pods configured with up to 2gb ram and 2vcpu each and --max_old_space_size=2048 on node so there should be plenty of resources available for larger publications. The behavior I’m seeing however is that even with only 1 client connected if the client subscribes to this large publication Meteor seems to go into an infinite loop and stop responding to requests.

2GB is not that much, and 2 vCPUs won’t help unless you’re running scale-out. Have you profiled the memory usage? You may find that that you start paging memory in and out of disk when you get close to the 2GB limit. That will seriously hit the CPU and will likely crash the node process at some point.

1 Like

Memory usage is below 1gb when the hang happens, 2GB is the lower allocation limit and my pods have no upper limit. My meteor pods are behind an nginx reverse proxy with sticky sessions on kubernetes. With a newly started pod if a single client connects to the pod and requests that subscription it hangs with 100% cpu, no memory pressure.

Do you see the same in development (using the meteor run command on your laptop)?

For sure theres a hard limit sometime before approaching the halting probability, defined by the Chaitin/kolmogorov constant :slight_smile:

I believe there are some open issues and pull requests that might shed some light on causes and potential future fixes so I’ll post links to those.

This next one is closed but I think could possibly be relevant.

3 Likes

There’s no inherent or architectural reason this issue occurs. Based on the reading of the issues here that @copleykj helpfully found, there could be regressions in the way node/fibers and newer versions of node interact with each other. But that’s probably a red herring.

To answer your specific question @imagio, the number of documents is probably a red herring too. What’s likely happening is between 2,000 and 3,000, you retrieve a user document that is:

  • Extremely large: It will appear to be at 100% CPU, because of the way the instrumentation works, but it’s actually transferring a multi-megabyte amount of data over a network connection that has surprisingly low bandwidth (on the order of 50 kB/s). Or, maybe you have a link to a forum image or something like that, which you “helpfully” download on the server, which coincidentally is a huge file. Or, you’re trying to do a “join”-like operation, and you’re accidentally retrieving way more data than you think you are.
  • Not valid: You use something like simple schema, or all these validators, which may accidentally cause an infinite loop trying to validate something. That stuff is really glitchy sometimes.

Paging won’t fix this. Try downloading your production database (using mongodump and mongorestore, perhaps using Mongo’s Compass application to set up the SSH tunnel to your protected DB for you) and inspecting it.

From an application architecture point of view, if your client’s need all the user documents (which naturally they don’t), create a single document in your database containing everything about all the users a client needs. If you’re reaction is, “That would be a huge document!”, well, then your original architecture going to be super slow anyway :slight_smile: Even if you use paging! A simple rule of thumb is to have as close of a correspondence between a Mongo document and the things the web client needs to see on screen as possible. I’m otherwise not really able to comment on whether you’re using fields to filter out stuff you don’t need or whatever.

2 Likes

Not discounting valid suggestions above but since no one has mentioned it; don’t forget the basics like proper indexing. Do a query explain() and check if doing full collection scan. NoSql is fast(er) but still needs indexing. In Meteor you can do (in server startup code):

Forum._collection.rawCollection().ensureIndex({field-you-querying:1}, callback)

If local (same cpu) DB then it could be your DB driver/engine maxing out rather than Meteor app?

I have had pubs of more than ~3k without any issues. Not that it’s some magic number or anything.

Also use fields: {} query option to limit what fields you are syncing to client to save some bandwidth.

1 Like

I have verified that I am not retrieving huge documents. Also have limited to about the ~25 fields I need. The database is on a separate server with ample bandwidth. I have proper indexes on my collections. I am not doing any sort of join-like behavior, just a .find() with an $in clause of several thousand ids. After reading through those bug threads it seems very likely that I am running into the fiber spike and/or blocking unsubscribe issues.

That’s pretty crazy and pretty disappointing.

What’s the point of upgrading node if it constantly introduces regressions? What good is someone’s experience if it all gets foiled by totally arbitrary, random, nonsensical bugs?