ObserveChange very slow to trigger

znewsham · February 27, 2021, 1:57pm

Are you sure your observer is using the oplog and not polling? If it’s polling it could take up to 10 seconds to trigger. You can use kadira to check this, but if you’re querying by _id you should be using the oplog (unless you’re also doing weird things with sort/limit). So long as your mongo db is a replica set (standalones don’t have oplogs by default)

While the overall approach could probably be improved, 5 seconds for observer triggering is excessive.

ivo · February 27, 2021, 2:11pm

Thanks for your feedback.

I checked and indeed it seems it was not using oplog, I have no idea why. It says:

|coll:|liveSessions|
| --- | --- |
|selector:|{"_id":"qX9adEreuKzCvczhE"}|
|func:|observeChanges|
|cursor:|true|
|oplog:|false|
|wasMultiplexerReady:|false|
|queueLength:|0|
|elapsedPollingTime:|0|
|noOfCachedDocs:|1|
|noOplogCode:|OLDER_VERSION|
|noOplogReason:|Your Meteor version does not have oplog support.|
|noOplogSolution:|Upgrade your app to Meteor version 0.7.2 or later.|

But I am on version METEOR@1.10.2 locally. Anything to do to activate it ?

Also on our dev environment it says:

|coll:|liveSessions|
| --- | --- |
|selector:|{"_id":"9HSaxJJnDvHjdjcdi"}|
|func:|observeChanges|
|cursor:|true|
|oplog:|false|
|wasMultiplexerReady:|false|
|queueLength:|0|
|elapsedPollingTime:|0|
|noOfCachedDocs:|1|
|noOplogCode:|NO_ENV|
|noOplogReason:|You haven't added oplog support for your the Meteor app.|
|noOplogSolution:|Add oplog support for your Meteor app. see: http://goo.gl/Co1jJc|

Do we have to change MONGO_URL to MONGO_OPLOG_URL ? or do we need both ?any other config needed ? I thought oplog was always part of the pub.sub I have other app having pub/sub which are extremely reactive and I never had to set up anything special… I’m getting confused.

The good news on the other hand is that I tried redis:oplog and it seems to be night and day, at least locally. I’ll set it up on our development environment from Monday but still would be curious to know why oplog was not working so far ?

znewsham · February 27, 2021, 2:26pm

You require both, the oplog url points to the local oplog db, beyond that so long as your mongo database is a replica set you should have an oplog. Redis oplog is great, but not usually necessary until you have multiple servers observing independent datasets, nothing wrong with implementing it earlier though!

ivo · February 27, 2021, 2:37pm

Any idea why it’s not working locally though ?

Also seems that redis:oplog does not need to create the replica set and so on so in a way makes it much easier to implement.

Is it my good understanding that I don’t need a second mongodb but just to create the replica set as mentioned in the procedure ?

znewsham · February 27, 2021, 2:52pm

I’ve never had a project that didn’t need a replica set. You want it for redundancy I’m case one server goes down, so the extra effort is required either way.

In dev, your db is a standalone, you need to configure an oplog, it’s easy enough you basically run rs.init() on the db.

Regarding why other projects have good reactivity without having done this, meteor had optimistic reactivity, I think if you’re just publishing cursors as normal, the reactivity is triggered without waiting for mongo oplog, not sure why that wouldn’t be working with a server side observe though (unless it’s running on a different server)

alawi · February 27, 2021, 2:56pm

Why do you have a job running every 250ms? why not react to user action?

For anything time-sensitive like that, I’d bet is on Streamer or something like redis vent.. You want to avoid touching the DB, and stream things while updating the DB async, kind like optimistic server streaming. But the first thing I would look into is that job.

ivo · February 27, 2021, 3:03pm

because user action can be concurrent to the timer and I had cases where it was not handled well. The last person was replying and timer was triggering at the same time and updating the database was conflicting and it was not reflecting what it should have reflected. having a job running every 250ms seems a good way to keep a single point of truth. Each user can only push their own action and the server is reacting for the “team” actions and the timer actions.

If you have another way to check that all 5 persons answered and then update some part only when the last one answered with avoiding for sure any concurrency issue I am more than open minded to it.

Not saying it’s the best solution, but it seems to be a working approach and cluster uses other cores for the task than the main one, even though we’re still in talk with nschwarz for optimization

edit: my other post details the flow I am struggling to solve. I’m very open to any feedback you may have to solve the way it should be approached

alawi · February 27, 2021, 3:14pm

I think there might be a better way to meet those specs than the 250ms job. There seems to be a lot of overhead in this loop.

I’m thinking aloud here, so feel free to ignore.

For the timer, we have a start date at the DB and when the session starts, the counter is started at the clients based on that start date so all are in sync. When this timer expires at the client, a method is triggered to get the result, so no reactivity is needed here.

For the teammate answers, we keep a track in the DB of who submitted what, in an array associated with the session. When an answer is submitted a method is called, which update the teammate answers array and then check if all the teammate responded, if all teammates responded, update a flag in the session, and lock any further submissions. Clients will have a subscription to this flag, and show the results page to move to the next round.

What would be the limitation of this approach in contrast to the job approach you currently have?

ivo · February 27, 2021, 6:54pm

Thanks for the feedback. Some good ideas.

Timer can not rely on client, how do you handle disconnection for example. We have requirements that people can leave the session and come back. But timer only could be handled by simpler jobs solution on server side, I agree.

Here, I have to make more research of how to handle multiple instances. Because my problem is that we query the session at a time, modify it and then check if everyone has replied BUT we could have multiple instance of the server running, so 2 people from a team replying simultaneously on 2 servers. People replying at the same time, according to each of them at the time they reply, they are not the last to replyas they are replying simultaneously but at the end no team answer is sent because none of them considered they were the last one. Not sure how to handle this in a 100% safe proof way.

ivo · February 27, 2021, 6:56pm

regarding overhead, if the job is triggered on a core you don’t use anyway… is this really overhead ?

alawi · February 27, 2021, 7:12pm

I don’t think it is the CPU usage overhead. It is more the processing latency, you are wasting 250ms + any additional processing time, which is the main problem here. How are you triggering and managing those jobs?

For the disconnect, how do you currently handle that? you would still need to detect the client disconnect and update something in the DB. So, how is that being managed now?

Regarding the concurrent protection, why not locking at the DB level since the DB is shared among all instances.

rjdavid · February 28, 2021, 3:17am

DB transactions. Both mongodb and redis support transactions

alawi · February 28, 2021, 3:19am

Maybe you are right about the server timer, it needs pub/sub given that it shares real-time session.

I’m just trying to see if there is away to optimize this process

ivo · February 28, 2021, 6:26am

Not sure what you mean to lock at DB level though the idea would be great. Any doc or example of what you mean exactly ?

Right now the problem is that, even handling one db and 2 instances:

If I have 2 instances running simultaneously, lets say:

Instance 1 is managing player A answering
Instance 2 is managing player B answering

They answered exactly at the same time.

According to Instance 1 and 2 none is the latest one to reply when I query the current situation, so I’d need an external process to check after both of them have added their reply. Not sure how to do this on db level.

Regarding the 250ms.
The 250 ms can actually be 100ms, it’s the refresh rate of GitHub - nathanschwarz/meteor-cluster: worker pool for meteor using node js native `cluster` module.
Having a 250ms latency is not really problematic, even one second for the whole process wouldn’t be the end of the world because we handle it on client side with a loader and people now it’s a team game so they understand you gotta wait for the other, but 5sec as it is now is unbearable.

We handle disconnect in a way that there is always a “captain” per team that can force to go to next so that players have time to reconnect (thanks to their session or localStorage they’re assigned back to their team). It still has flaws but that’s the next work we have to do but less priority.

ivo · February 28, 2021, 6:27am

Don’t think I had to deal with this until now. Any good ressources you could recommend on the subject. I’ll check the documentation. What do transactions allow you to do ?

alawi · February 28, 2021, 7:00am

I still think you are not factoring in the processing time of this package, which is not the same as the refresh rate interval. This package is for handling heavy jobs on separate processes, I don’t think it designed for handling near-real-time jobs with very low latency.

I’ve not used it but my feeling is that it has processing latency and it is the culprit in your problem since this is where you diverge from a typical pub/sub, but that is my intuition, I might be wrong, just trying to help.

As for the concurrency, so you mentioned the following in your other post:

Where is the job collection being stored? is it in memory at that synchronization/pulse server?

ivo · February 28, 2021, 7:47am

I’m in direct discussion with Nathan who is the maker of the package and he advised me to use it like this. Actually will have a call with him tomorrow so I’ll have more insight. I may not use it fully properly which is another problematic but the process is pretty smooth actually.

Not sure what you mean by processing time of the package, but when running in a empty loop the package is taking 2ms and not blocking the main queue. https://blog.meteor.com/scale-your-app-with-multi-core-7fba03192ea2

The jobs is a specific collection in mongodb,
it has some needed info for the logic and that’s it.

I’m looping through the jobs to do my action. When a job is completed I delete it from the collection.
So generally the jobs collection doesn’t have more than 20 or 30 items., it’s very fast to loop through. actually everything was working smoothly except the refresh so maybe with redis:oplog we’ll be good to go.

alawi · February 28, 2021, 8:31am

Sorry I reread the thread more carefully. And your issue with the observer latency as you pointed out.

And echoing what others stated, without oplog the default would be polling which add latency. And there could be locking/delay happening at the DB as well. Both issues can be by passed using Redis oplog, Vent or streamer. Sorry again I was just trying to see if there was a better approach first before optimizing the current one.

ivo · February 28, 2021, 8:54am

Don’t be sorry, you raised very interesting points. Thanks a lot for taking the time.

Oplog was clearly an issue. We still have some load issues using the cluster library but hopefully will be worked out in the following days. There surely also are processes that can be improved in our approach and I appreciated your brainstorming!

I am sure there are different way in achieving the final solution, as usual in coding, and it’s nice to have input from others on how they would deal with it.

rjdavid · February 28, 2021, 1:49pm

Allows you to execute a group of queries while having a lock on the underlying documents (as @alawi mentioned above). It removes your problem of simultaneous updates and finds overlapping from different sessions