Hypothetical long running Fibers question

Before I begin, I know this can be handled in different ways, e.g., invoking an external script with a low CPU priority, or spinning up a separate server or even using AWS Lambda (or equivalent). I’m interested in this more from a hypothetical standpoint. I’m also aware that fibers don’t block the event queue.

Let’s say I have a Meteor call which takes a “long” time to run, for example 35-45 seconds. The user needs to be told when the method completes (if they are still online), but it doesn’t really matter how long it takes, it could take 30 seconds uninterrupted, or it could be delayed by upto 5 minutes if required.

Could one use Fiber.yield/run to frequently pause and resume a fiber and thus ensure responsiveness of other meteor methods being executed on the same server.

A use case here would be a server is running at ~50% CPU, still responsive, but then a user kicks off a job that is not time critical, but if left to run alone would push the CPU to 95/100% CPU, at which point all other users would notice their pages become less responsive. I ask because 99% of the time our CPU load is < 25%, but occasionally it spikes to around 60% when one of these jobs starts, 60% is fine, but if lots of people trigger jobs like this at the same time, it would cause trouble.

I’m thinking of a package that allows a Meteor method to “queue” up a non time-sensitive job and the server would pause and resume jobs to maintain a target CPU load, e.g., > 60% pause all jobs, < 60% start resuming jobs. It could then also consider the user who is logged in (e.g., a payed users job would get paused last.

2 questions:

  1. Are there already packages which accomplish this?
  2. Am I worrying about nothing? How un-responsive does an app that is running at 99% CPU (i.e., not quite full capacity, but really close) become, in the case where all the CPU usage is coming from fibers?

I guess the obvious question is couldn’t you offload the jobs to a separate server? I think nodejs in general is not designed to handle heavy load, it’s designed to take many requests via the event loop and dispatch it quickly.

From what I’ve seen once the CPU spike above 70% the app starts to hang, if it stays long like that it’ll crash since it can’t run other critical functions. If you’ve many user running many heavy jobs then you need to add more machines .

Yes, we are running about 20 currently, and for the most part yes, we’ll look at running more servers - but as I said, this is more of a hypothetical, I’m intrigued if fibers would allow this - because it would potentially allow for CPU limiting individual users/groups of users

I see so I think you to want have custom applications logic to conditionally execute/pause the job based on the available CPU cycles and you’re wondering if you can control that using the Fiber API?

Exactly - I know you can yield a fiber (though I don’t know if you can do this from outside the running fiber, which would kill this idea outright), but I’m thinking of an interface like this

LimitedTaskRunner.run(fn, cpuThresholdLimit)

or even

LimitedTaskRunner.run(fn, dontStartIfCpuIsAbove, pauseIfCpuIsAbove)

So we don’t even start a job if the CPU is highish (e.g., 50%) but if we do start it, we can stop it if things get too heavy (e.g., 70%). This might stop us from thrashing the CPU when we’re already close to our limits. One could even add some machine learning to something like this, to determine which functions (with which arguments) are likely to be heavy so you can estimate how much this will impact current CPU usage.

1 Like

Seems like you can’t yield from outside the fiber, so we’d probably have to implement a function like Fiber.maybeYield which can be cauesd at points that the function can be paused at, and it checks the CPU usage and either yields or continues. So this wouldn’t be fully generic, but I think any method that is heavy would have potential yield points, either in a loop or in the rest of the code. But it would make it more difficult to reuse existing functionality

Yeah, well I admittedly don’t have much experience in that area, but I find the idea interesting, so just sharing thoughts.

It sounds like what is being asked here is a generic framework that manages jobs execution based on available resources (CPU, idle time patterns, among others) at run-time. That framework make sense on tight resource constraints per VM where you’ve jobs that can yield. I’m just wondering if the added complexity worth the benefit. Is there a reason why not putting those jobs on separate VMs?

Hmm, I’m not sure how much complexity it would cause (after the initial package implementation). For example a function like this:

function myLongFunction() {
   MyCollection.find().forEach((item) => {
        //do something expensive
        LimitedTaskRunner.maybeYield();
   });
}

Meteor.methods({
    myLongFunction() {
        LimitedTaskRunner.run(myLongFunction, 60);
   }
});

I’m not overly familiar with custom VMs, but how easy is it to spinup a new VM and wait for the result? I guess you could have an entirely separate Meteor server, and put your common code in a package, and the purpose of the secondary meteor server is to run these long running functions - it can run at 100% without impacting regular performance, but wouldn’t be able to limit a functions execution based on the current user, for example, the package could expose an API like this:

LimitedTaskRunner.yieldIf((userId, currentCpuUsage) => {

});

This way you can easily define custom logic as to whether or not to yield on a per user basis. This on its own is useful (e.g., limit free users, but dont limit payed users, or limit free users and limit payed users less, or check to see how many functions a specific user is currently running and yield or not). What would be even more useful is if there was a way of detecting how much CPU time a user had used, given that Kadira is able to compute some of these values, I think we could also sum up the total CPU time and yield only if a user isn’t being excessive.

A tiny bit more detail on my thought process: The project I’m working on up until now is always used by companies, who pay quite a reasonably large amount, and there is a limited number of them, and they are online typically at different times of the day/month/year in some cases. But we just opened up the system to turnkey users (e.g., put in a credit card and go) and I worry that individual users could end up consuming a lot of resources (or a lot relative to the amount they pay).

I am completely open to other suggestions too

1 Like

Yes I personally would do it that way without cluttering the app logic with performance management. Create a micro-service for those jobs, you can use Meteor (something like the Steve Job package) or pure node, and you can utilize all the CPUs and create a queue for the jobs. That way you can monitor this server on it’s own, scale it vertically etc.

I can see how it would work well on a meteor server, just a remote method call and wait for the response, but with a node process I would think the extra code necessary to spinup new processes (and potentially new servers) would be quite complicated, and potentially expensive if your primary server was currently idle anyway), and wouldn’t handle the case of per-user (or possibly per connection) resources.

If the only concern is CPU and memory (which right now it is) something like AWS lambda would work well, though all code which currently is Meteor dependent would need to be converted to pure node (not super hard, but not trivial either). This still wouldn’t give you per user/connection rules, but would be cheaper and probably easier than spinning up a new server.

We currently use both lambda and spot servers to handle very specific workloads that are REALLY heavy and CPU bound, but I think in most cases it would be quite expensive.

The other consideration is that CPU usage isn’t the only metric of interest here (whether per user, per connection, per DDP call or globally). Another useful metric would be DB calls - this is somewhere that the DDP rate limiter doesn’t really help, a method call that triggers a single DB update, or returns a single DB document is treated the same as a call which updates 1000 documents, or returns 1000 documents. We have some methods that allow “bulk actions” to be taken over a range of documents, rate limiting doesnt help here.

I think I might have a crack at implementing this as a package. I don’t think it would be particularly hard just for CPU usage initially.

1 Like

Don’t overcomplicate things.

You can create a document that represents the progress of your long running task. Start the task in a Meteor.defer(). Update this document from time to time, which will also yield. Subscribe to changes in this document. Add a createAt and modifiedAt field, and decide, internally to your application, how long to wait until the document is last modified before deciding that the job is dead (i.e., “fail-safe” for the long running process).

The job will die if you take down the node, which is true for any architecture. Use the job document to recover what was started, and maybe restart it somehow. Use it for storing progress. Don’t use other packages, this stuff is incredibly straightforward.

Thanks for the suggestion, but this won’t cover the ability to limit different users differently, nor the ability to scale throttle based on current usage

Think about what you’re saying. Throttling usage on a web server? You should try to run at 100% CPU and RAM, otherwise you’re wasting your precious AWS dollars :slight_smile:

On a more technical note. Don’t try to write a scheduler. Don’t try to schedule fibers or servers or processes. Throttle based on job size, which you can determine before running the job—that is, don’t start big jobs for users who don’t pay. Or set a time limit. Your load will be proportional to the number of jobs in flight, which will be proportional to average lifetime of the jobs, which will be proportional to the size of the job.

If this is a numeric task, or something that resembles a tree search, it’s beneficial to use a separate process if reducing your transient memory usage is absolutely essential. Alternatively, if you want to run it in the same process as the web app, remember that node (like java) stores everything on the heap: you have to use object pooling to keep transient memory low. However, I’d warn that there are a lot of pitfalls to doing so on node.

I would love to run at 100%, but I don’t want other users to be stuck waiting for someone elses job to finish, that is non time sensitive. I invite you to try this simple test:

Meteor.methods({
longFunction() {
    this.unblock();
    const future = new Future();
    Meteor.setTimeout(() => {
      let res = 0;
      for (let i = 0; i <= 1000000000; i++) {
        res += Math.random() * Math.random();
      }
      future.return();
    }, 5);
    return future.wait();
});

Open a tab, call that method - and try opening another tab - your tab wont load until 15-30 seconds later, when this method finishes. This is of course an exagerated example, but it ilustrates things nicely. Some methods can be delayed to ensure responsiveness of the app as a whole.

Now, this isn’t to say this is the best approach in general, its likely that spinning up an entirely separate process, or server will be better - but I’m interested to see if this is possible, and what the drawbacks are.

Yes, do this:

Meteor.methods({
longFunction() {
    let jobId = Jobs.insert({progress: 0});
    Meteor.defer(() => {
      let res = 0;
      for (let i = 0; i <= 1000000000; i++) {
        res += Math.random() * Math.random();
        if (i % 1000 == 0) {
            // This will yield in the way you need it to.
            Jobs.update(jobId, {$set: {progress: i}});
        }
      }
    });
    return jobId;
});

It will yield, and immediately resume - that isn’t what I want it to do. I only want it to yield if it needs to, and I only want it to resume if it the server has capacity to

What do you mean by immediately resume?

It’s not going to keep processing until the database has updated. During that time, it may choose to process any number of other waiting tasks on the event loop.

Fiber_Id         Progress         Unparked_By        Parked_By
0                    1000         Database return    Database call
1                       ~         Method call        Method return to DDP
0                    2000         Database return    Database call
2                       ~         Method call        Method return to DDP
0                    3000         Database return    Database call

What do you think the operating system scheduler does?

What does it mean, yield if it needs to? Just think about it. Supposing these were different processes, how do you think the OS knows it “needs to” yield?

Real processes/threads are parked (yielded) all the time!

Resume if the server “has capacity”? I know you want just a simple answer here, but again, by what meaning?

I think if you’re not sure what the answers to these questions are, by all means, start a node process. Operating system scheduling works for everyone, and it’ll work for you. If you want to park and unpark fibers, by all means, do that too. If you want to have 100% control of when a thread/fiber (called strands in the literature) run and don’t run, you need a thisStrand.parkAndUnpark(otherStrandToResume) primitive. Most applications use queues as a high level object to achieve this.

The most important thing to realize is that if you run the code I gave you, everything’s going to be fine and dandy. Under the hood, calls to a Mongo.collection park the caller until the database is done writing. Between the database update and its return value, the event loop will keep getting processed. If you want to like, idle more, just call Fiber.sleep. If you want to “yield more often,” reduce the modulo operand (the part after the %), but that’s pretty wasteful. I don’t see why you’d want to do that, since if it could do more work, it will!

I’m not really looking for an answer - unless the answer is “here is a library that already does it” or “here is someone who tried it and all the reasons it cannot work”.

The operating system scheduler is irelevant here - unless I’m totally misunderstanding how node works, everything runs in a single process. Not just one process but a single thread - so given that pretty much the only thing running on my server is Meteor, the operating system scheduler isn’t useful here. Feel free to correct me on this - it would be a massive weight off my shoulders, but I’m about 99% sure I’m correct.

“Yield if it needs to” means exactly that - yield only if some condition is met - if you look at your example, the condition is trivial, yield every 1000 jobs, and resume as soon as the DB update is complete (a couple of ms later? Less?) - let’s ignore the load that would add to the oplog and mongo DB 1000000 unnecessary updates every 30 seconds, yikes.

“resume when the server has capacity” also means exactly that - when the CPU usage of the server drops below a certain threshold, resume - this way when the server is idle the job runs as fast as it can, when the server is busy, the job runs more slowly - but the server is still responsive. The purpose of this is not for REALLY long running jobs that should run in their own process, but jobs that take 2-3 seconds, one or two users running that job, no big deal, 20 or 30 users running that job… This is a bigger problem, but potentially not worth spinning up a new server, or new process to handle.

It is not just CPU usage I am interested in - DB calls is another metric of interest, throttling DDP invocations only goes so far, depending on the arguments issued a DDP call could return 1, 50, 5000 documents, it could return just 1, but require 50 or 1000 documents to get the information required by the one.

“If you want to like, idle more” - yes, I do, but I want to Idle more by an amount that changes, based on usage, using values that I don’t want to rewrite/redefine in every function where I want to use it - one option here is abstracting this type of functionality to a package…

Sure, this is really easy.

var Fiber = require('fibers');

function sleep(ms) {
	var fiber = Fiber.current;
	setTimeout(function() {
		fiber.run();
	}, ms);
	Fiber.yield();
}

Meteor.methods({
longFunction() {
    let jobId = Jobs.insert({progress: 0});
    Meteor.defer(() => {
      let res = 0;
      for (let i = 0; i <= 1000000000; i++) {
        res += Math.random() * Math.random();
        if (i % 1000 == 0) {
            // This will yield in the way you need it to.
            Jobs.update(jobId, {$set: {progress: i}});
            // Now idle
            sleep(Jobs.count()*1000);
        }
      }
      Jobs.remove(jobId);
    });
    return jobId;
});

You’ve misunderstood me - I was just confirming that your code would pause for a few ms.

I’ll try once more - at build time, I do not know how long I will need to yield for. I will only know that at the time the method is called, and each time it is called, the amount may change - depending on how much load the server is under.

All I’m suggesting is writing functionality that will look at the current CPU load, and determine
a) if I should yield and
b) when I should resume.