We’re using a sub/pub in a place where it can not be replaced by a Method. Reactivity is much needed, actually the core of this part of our platform. When people connect they sub to a very simple publication:
As you can see, no index issue, we’re just getting one document from the collection and need to keep in sync with it. Number of users can be from a few to hundreds but it seems it doesn’t matter as problems already appear with only 1 user.
We have a flow such as:
user submit an answer
when all user submitted their answer a team answer is automatically submitted (using cluster from nschwarz:cluster to manage this)
user get updated Session information to move on.
I set up plenty of timestamp in order to have an idea what could have been the reason for slow performance but it seems that the pub reactivity (detected thanks to observe changes) is the culprit but I don’t know why.
let’s say user A submit his answer at t=0 and he is the last one of his team
submitting team answer was triggerred at t=20/270ms thanks to a running job permanently (every 250ms) checking the situation of a team
The function that modifies the db is finished after t=150/400ms
the observer is triggered at t=5000ms !!
changes are on client side at t=5500ms
I can not explain why the observers takes so much time to be triggered. Also it’s very random, sometimes it’s almost instant but, too many times, it takes this much time.
Is it possible that this is a db issue? Is that job running every 250ms querying the db?
We are using redis-oplog so I am more familiar with it and you can fine tune the reactivity during updates.
But last, you might need to think of a solution bypassing mongodb. Save the answers using in-memory db like redis and use a package like meteor-streams (seems there is a more updated one) to manage reactivity. Save only in mongodb once all answers have been provided by the team but this is no longer required for reactivity
Yes the job every 250ms is querying the mongoDb. Most generally one query and moving forward and sometimes work to be done. Metrics on the db don’t show any saturation (20% usage of cpu or memory on our db instance) but maybe it still affects the reactivity.
RedisOplog will need us to have a redis instance I guess? Not very familiar with Redis even though we started having a look at it. Seems that could be an interesting approach indeed but not sure we have the time to implement it.
Can redisOplog have any bad side ? It seems like a magical solution to improve reactivity Is it setting up a redis layer between the user and the mongodb to manage the updates ? Would I have any other code to change if I set up redis oplog to see if it has any favorable impacts ?
Is it good practice to use the observeChange to insert some logic. I was thinking I could stop the 250ms job checking up the database and rather use the this.Changed in the observeChange to check if all users of a team have replied. So move the logic which is now on a separate worker thanks to the cluster, to the main probably (not sure I understand properly how this work) but trigger it only on the observeChange. I am scared this could be even more ressource intensive or ressource blocking as I’d have to loop quite a lot in the method as part of the observe change and there are a lot of changes happening so I am scared it could lead too much worst result than my actual solution.
Are you sure your observer is using the oplog and not polling? If it’s polling it could take up to 10 seconds to trigger. You can use kadira to check this, but if you’re querying by _id you should be using the oplog (unless you’re also doing weird things with sort/limit). So long as your mongo db is a replica set (standalones don’t have oplogs by default)
While the overall approach could probably be improved, 5 seconds for observer triggering is excessive.
I checked and indeed it seems it was not using oplog, I have no idea why. It says:
| --- | --- |
|noOplogReason:|Your Meteor version does not have oplog support.|
|noOplogSolution:|Upgrade your app to Meteor version 0.7.2 or later.|
But I am on version METEOR@1.10.2 locally. Anything to do to activate it ?
Also on our dev environment it says:
| --- | --- |
|noOplogReason:|You haven't added oplog support for your the Meteor app.|
|noOplogSolution:|Add oplog support for your Meteor app. see: http://goo.gl/Co1jJc|
Do we have to change MONGO_URL to MONGO_OPLOG_URL ? or do we need both ?any other config needed ? I thought oplog was always part of the pub.sub I have other app having pub/sub which are extremely reactive and I never had to set up anything special… I’m getting confused.
The good news on the other hand is that I tried redis:oplog and it seems to be night and day, at least locally. I’ll set it up on our development environment from Monday but still would be curious to know why oplog was not working so far ?
You require both, the oplog url points to the local oplog db, beyond that so long as your mongo database is a replica set you should have an oplog. Redis oplog is great, but not usually necessary until you have multiple servers observing independent datasets, nothing wrong with implementing it earlier though!
I’ve never had a project that didn’t need a replica set. You want it for redundancy I’m case one server goes down, so the extra effort is required either way.
In dev, your db is a standalone, you need to configure an oplog, it’s easy enough you basically run rs.init() on the db.
Regarding why other projects have good reactivity without having done this, meteor had optimistic reactivity, I think if you’re just publishing cursors as normal, the reactivity is triggered without waiting for mongo oplog, not sure why that wouldn’t be working with a server side observe though (unless it’s running on a different server)
Why do you have a job running every 250ms? why not react to user action?
For anything time-sensitive like that, I’d bet is on Streamer or something like redis vent.. You want to avoid touching the DB, and stream things while updating the DB async, kind like optimistic server streaming. But the first thing I would look into is that job.
because user action can be concurrent to the timer and I had cases where it was not handled well. The last person was replying and timer was triggering at the same time and updating the database was conflicting and it was not reflecting what it should have reflected. having a job running every 250ms seems a good way to keep a single point of truth. Each user can only push their own action and the server is reacting for the “team” actions and the timer actions.
If you have another way to check that all 5 persons answered and then update some part only when the last one answered with avoiding for sure any concurrency issue I am more than open minded to it.
Not saying it’s the best solution, but it seems to be a working approach and cluster uses other cores for the task than the main one, even though we’re still in talk with nschwarz for optimization
edit: my other post details the flow I am struggling to solve. I’m very open to any feedback you may have to solve the way it should be approached
I think there might be a better way to meet those specs than the 250ms job. There seems to be a lot of overhead in this loop.
I’m thinking aloud here, so feel free to ignore.
For the timer, we have a start date at the DB and when the session starts, the counter is started at the clients based on that start date so all are in sync. When this timer expires at the client, a method is triggered to get the result, so no reactivity is needed here.
For the teammate answers, we keep a track in the DB of who submitted what, in an array associated with the session. When an answer is submitted a method is called, which update the teammate answers array and then check if all the teammate responded, if all teammates responded, update a flag in the session, and lock any further submissions. Clients will have a subscription to this flag, and show the results page to move to the next round.
What would be the limitation of this approach in contrast to the job approach you currently have?
Timer can not rely on client, how do you handle disconnection for example. We have requirements that people can leave the session and come back. But timer only could be handled by simpler jobs solution on server side, I agree.
Here, I have to make more research of how to handle multiple instances. Because my problem is that we query the session at a time, modify it and then check if everyone has replied BUT we could have multiple instance of the server running, so 2 people from a team replying simultaneously on 2 servers. People replying at the same time, according to each of them at the time they reply, they are not the last to replyas they are replying simultaneously but at the end no team answer is sent because none of them considered they were the last one. Not sure how to handle this in a 100% safe proof way.
I don’t think it is the CPU usage overhead. It is more the processing latency, you are wasting 250ms + any additional processing time, which is the main problem here. How are you triggering and managing those jobs?
For the disconnect, how do you currently handle that? you would still need to detect the client disconnect and update something in the DB. So, how is that being managed now?
Regarding the concurrent protection, why not locking at the DB level since the DB is shared among all instances.
Not sure what you mean to lock at DB level though the idea would be great. Any doc or example of what you mean exactly ?
Right now the problem is that, even handling one db and 2 instances:
If I have 2 instances running simultaneously, lets say:
Instance 1 is managing player A answering
Instance 2 is managing player B answering
They answered exactly at the same time.
According to Instance 1 and 2 none is the latest one to reply when I query the current situation, so I’d need an external process to check after both of them have added their reply. Not sure how to do this on db level.
We handle disconnect in a way that there is always a “captain” per team that can force to go to next so that players have time to reconnect (thanks to their session or localStorage they’re assigned back to their team). It still has flaws but that’s the next work we have to do but less priority.
I still think you are not factoring in the processing time of this package, which is not the same as the refresh rate interval. This package is for handling heavy jobs on separate processes, I don’t think it designed for handling near-real-time jobs with very low latency.
I’ve not used it but my feeling is that it has processing latency and it is the culprit in your problem since this is where you diverge from a typical pub/sub, but that is my intuition, I might be wrong, just trying to help.
As for the concurrency, so you mentioned the following in your other post:
Where is the job collection being stored? is it in memory at that synchronization/pulse server?